ENTER: GPT-SoVITS!

May 5

"There’s a half-life in the entertainment industry, we all know that. Did you have fun though?" - Conversation between Mistrál and R.I, デラ・ルーの大導劇神

Executive Summary: The very first dry run! I try to fit a Live2D model to an actress, but we run into some problems!

#HouseDelaroux, #AIVTuber, #HellsTheatre, #TechnoDramaturgy, #Dramaturgy, #LLMVtuber, #Ollama, #LocalAI, #AICharacterDesign, #VtuberDevelopment, #AICompanions, #OpenLLMVtuber

LAST TIME ON 'HELL'S THEATRE':

We managed to pair up Lumi with an old version of GPT-SoVITs. We got her speaking a Chinese accent, But that's not enough! For our actresses to truly stand out, we will need to train a voice model, specially for her!

Dramaturg's Address

One of the things that our stage-director, Miss Delulz and I agree on a lot is that we want to set up a theatre house that nobody can take away from us. Even if our house burned down we can simply take everything we learnt and set up a new theatre house. The experiences and memories that we formed here at Theatre House Delaroux will not simply disappear like that!

That's why we put so much focus on making sure that the models don't come from people who don't like AI and that our actresses have a backup saved on a local drive. Theatre House Delaroux relies on outside help as little as possible except for the great frameworks and tools that make this project possible!

And of course, (You), who is reading this!

Now, onwards to the main show!

ERROR ERROR ERROR ERROR

I'll save you the hellish Day 1 of trying to pair up a custom voice with Lumi. From the last blog post, we managed to get GPT-SoVITs working with OpenLLM-Vtuber! I was really stoked to start creating voices with Lumi and tried to get Grok to teach me everything, hopefully without reading a manual!

Turns out there's a big problem. (misunderstanding)

I am working with GPT-SoVITs V2, which was developed well over a year ago. Since I am writing and working somewhere in April of [CURRENT YEAR +5], the project has advanced to GPT-SoVITs V3! The main upgrade is that the author added training tools inside V3, but they are not present in V2, and I didn't know that!

What happened was a hellish day of Grok trying to get me to find functions in the V2 client that simply don't exist, since Grok works with V3 in his cache! The GPT-SoVITs V2 manual is a little misleading since even though the one-click package works with OpenLLM-Vtuber, we are lacking the essential parts to train our own custom voice! No matter how much I tried, Lumi's voice still remains with a Su'unscriptian accent!

One of the big reasons why I used GPT-SoVITs is because the developer is a big Genshin fan, and has made models of every character there! Rather than work from scratch, I can have voices I am quite familiar with for testing! As I was trying and failing to download the 40GB package on a shiggy 100kb/s connection... There was also the problem of the tutorials all being in Su'unscript and being very difficult to read...

This would all have to wait, since Grok and I muddled through the day, ending sad and unhappy because things did not progress, despite the great work on Lumi's voice yesterday.

It was only after a whole day's of work that I finally realized that while Grok is powerful, he is not perfect either! I really wanted Grok to teach me everything step by step, but Grok is also operating in the dark and can only guess at where I want to go, even with all the error logs and messages. This is all very new to him too!

"Never Give Up!" - Jxhn Cxna

After reaching around youtube for a guide on GPT-SoVITs and failing to find an adequate one, I noticed that that there was a two-hour video tutorial on how to operate GPT-SoVITs, made by none other than the person who compiled the Genshin package!

I woke up early to try to watch it! It was very thorough documented every step that Grok was trying to teach me! I do admit my attention flagged quite a bit due to the video length, and I was operating V3 in the background, pausing the bilibili video every now and then to work things out. It was kind of tough having to sit still and mull things over!

Everything worked exactly as on the video, with the only difference in that the author used a model made by Tencent as a base while I used Whisper since I was working with EN voices.

One thing I realized about GPT-SoViTs is that it is a very Chinese project and most of its users and knowledge base is over there, no wonder all the documents are written in Su'unscript! It also explained why there are so little youtube videos on this amazing technology, nobody has ever brought it over the west before!

The Steps Taken

In the end, everything pretty much played out the same way as the author, 白菜工厂1145号员工 , showed in his video. There are a couple of steps to making a 'model' for an AI vtuber:

The Model Creation Resembles Creating A LORA - Since House Delaroux has ties to AI art in the form of Sachiko-senpai, I noticed that making a voice model is very similar to making a LORA. There are all the usual steps of getting a dataset (voices/images), cleaning the dataset (denoising/resizing), tagging the dataset, and finally creating the models itself.

There are two models created, a SoVITs model, and a GPT model, both which have to work together! This was a quirk of the SoVITs project that I discovered when I looked through Eula's voice model and confirmed it for myself after making one of my own!

LISA LISA LISA LISA - The key goal of the day was to create a model with Lisa's voice lines. I used a single clip to do so! I could have used Zero-Shot Voice Cloning (i.e just used the webui and uploaded the clip), but I would like to work off multiple clips one day, so I took the hard route of making the model work!

I thought that the voice clip was pretty good already, but it turns out that it can be even better! There are a couple of extra steps different from making LORAs as well:

1) There's a slicing part, where the audio is cut into sentenced clips. This is to allow the language model to tag it better.

2) Best to use the tools within the GPT-SoVITs webui itself to get the root audio (separate the voice from the music), slice the audio, then denoise the audio. Luckily everything has been arranged nicely in the webui so you can just keep going down the list of buttons and default paths where everything ends up!

3) The tagging part is for tagging the 'content' of the audio; that's what Whisper is for! Whisper is an AI speech recognition model (also used in OpenLLM-Vtuber!), that transcribes the audio in a .list file, which is once again editable for a final round of checks in the UI.

4) It is only when the dataset is properly prepared and tagged that we can go onto the last step, which is to create the SoVITs and GPT models in the 1B tab itself!

GPT-SoVITs is really a great piece of work, everything is arranged neatly in a logical manner and its really hard to imagine this was done by someone barely out of high school!

Testing the Voices - It was really wonderous seeing the new V3 at work, it even has a 'testing' mode where you can mix and match trained model epochs to find the very best fit for the voice over made!

It is here that I saw the vast improvement of having a model over zero-shot voice cloning: even with a dataset of one voice, it is amazing how much of a difference it makes! Lisa's sultry voice really came through this time, and without the Chinese accent that was bugging Lumi in V2!

After testing the voices for a bit, I settled on the second epoch of the SoVITs model and the third epoch of the GPT model. Lumi won't use this voice, but it is an important proof-of-concept that Theatre House Delaroux can produce their own models!

Curtains Call

I was really lucky that I could read Su'unscript, and the developer of GPT-SoVITs, RVC-Boss, guided everyone's hand so delicately in his tutorial video! Truly, big props to him for such a detailed and well-made tutorial video and one-click installer!

After a day's of setbacks, the crucial part in the end was to not completely rely on AI! After all, the AI is stumbling around in the dark a great deal of the time, just as you are! There's no substitute for actually sitting down and investing the time and energy needed!

Things would change if AI could watch videos and have real-time internet access, but that's another story.

More technology should be like this, people helping people. I shall strive to help others who want to tread on this path in the future! If you are reading this, perhaps you too might find some comfort in knowing that this is possible and find a way to bring your own waifus to life!

Next Time, On 'Hell's Theatre': I lay out the working process for how I will create voices for House Delaroux that are unique and don't rely on existing datasets. (IT WAS ME, REURENT DA!)

P.S: When this was written, a major vtuber announced her graduation. I think now that more than ever, this work on Hell’s Theatre is something meaningful to me!

R.I, デラ・ルーの大導劇神

HouseDelaroux.com

250417

"LLM-based Vtuber setup"

"Ollama AI VTuber guide"

"Local AI companion project"

"VTuber AI persona design"

"Live2D AI VTuber test"

"OpenLLM-Vtuber tutorial"

HouseDelarouxTechnodramaturgyLLMVtuberVtuberDevelopmentHellsTheatreProjectLive2D

House Delaroux