TTS training subnet

The Subnet's Mission Statement

In this subnet, we aim to gather the wisdom of the whole open-source community to create the best open-source TTS models.

(What does that mean?)

Product / usecases from the subnet

From my research, I am only able to find this product, https://app.myshell.ai/robot-workshop/widget/1788537048029802496. This requires a login, and I'm not able to find any information on how to use the model outside of this - nor see any usage statistics.

This seems like a big missed opportunity. TTS models are very easy to present the usage of. I would advise the team to make more easily accessible ways to interact with the outputs.

Of course, the subnet itself just produces model weights, and not inference, so they would need to be hosted by the team or validators to be used.

On that subject, I don't see how validators themselves can get value from this directly. Of course, it could provide a massive positive influence on bittensor, but I think this is missing right now.

2/10

Incentive mechanism

The repository readme features nice diagrams explaining the mechanism.

I would love to see more detail on the 'Speaker Alignment Rater' and the 'Pronunciation Rater', to see how they actually work in practice, and whether they are robust.

But on the face of it, it seems well thought out and designed.

There is also this file which gives more info.

All subnets should do that. Unfortunately, this file is an old code copy and pasted from old Nous subnet 6, and is very out of date. So subnet 3 should do it too!

Initial dataset size Very limited to just a fixed list of samples

All miners need to do is heavily overfit these samples, and fine-tune the model to get these samples right. I am surprised how this can still exist in the subnet some 6 months after launch.

Judgement So, let's look at how scores are actually judged:

myshell-test/judge Myshell uses their own custom model for something; I should probably go into more detail here.

Word error rate; calculated using whisper. Miners know what the text is possible, so can overfit their model.
Anti spoofing loss I've no idea what this is, and I've asked about it.
Tone Color loss. An attempt to try and match the speaker voice. I'm not knowledgeable enough to say if this is a good loss function or not, or whether cosine similarity is the best way to measure it.
Judge loss I don't mind the idea, but I fear this is not very transparent. It looks to me like this is a custom model trained to take two inputs and compare how close they are? Or if they are the same speaker? Not sure.

Consensus alpha

The subnet uses a 'Consensus alpha', to make sure validators stay close on vtrust to each other. I'm not sure if this is a good thing or not.

While validators stay high on vtrust, it goes against the philosophy of Bittensor. Validators should form independent opinions, and the network should aggregate them. This is one of the key principles of Bittensor. By blending your weights to match everyone else, we are diluting strong opinions.

5/10 from me. Worries about certain aspects, but there are a lot of measures that could make it robust.

It would be more obvious the benefit if we could see the output.

Code quality

Not amazing.

These are some examples of very 'researcher'-esque code, which is quite hard to read 1. 2. 3.

There are a lot more. It's quite difficult for non-TTS experts to understand the validation mechanism, which is a bad quality for a subnet in my opinion.

I understand the measuring of miners must be specific to TTS, but the code should be clear for validators and miners to understand.

5/10

Miner competition

There's pretty much 1 miner dominating this subnet, likely uploading lots of duplicate models to claim all the rewards.

This makes sense in a way, as Bittensor is permissionless, so one participant will usually win.

Missing is any historical indication of what is happening, who the winners have been, and how it has changed hands. I can't see any indication from their Hugging Face of the improvement over time, or any metrics to how good the competition has been.

4/10

Potential

From the github:

Roadmap
As building a TTS model is a complex task, we will divide the development into several phases.

Phase 1: Initial release of the subnet, including miner and validator functionality. This phase aims to build a comprehensive pipeline for TTS model training and evaluation. We will begin with a fixed speaker.
Phase 2: Increase the coverage of the speaker and conversation pool. We will recurrently update the speaker and language to cover more scenarios.
Phase 3: More generally, we can have fast-clone models that can be adapted to new speakers with a small amount of data, e.g., OpenVoice. We will move to fast-clone models in this phase.

It seems MyShell has progressed to phase 2 after around 6 months of development.

Having said that, the updating of speakers has slowed down massively, and there is less and less active development work.

I think the space this subnet is in has immense potential, but we need some renewed enthusiasm to get there.

4/10

Contributes to decentralized AI

Good contributions to decentralized AI. However, we are only optimizing MeloTTS models at the current time, and only for a select number of speakers.

7.5/10

Subnet utilities (e.g., dashboards, insight for miners and validators, etc.)

https://huggingface.co/spaces/myshell-test/tts-subnet-leaderboard

This is great, as it shows the leaderboard and allows people to see how they are doing, and what they need to do to compete.

I would like to see head-to-head comparisons or more information on how the scores are actually calculated, but it's a good start.

4/10

Decentralisation

Quite a few worries here, with this subnet biased towards MyShell.

Miners take MyShell's own TTS as a base, which of course they have a lot of knowledge on. Judgement is done with MyShell's own model, which they control and upload. There is a lack of clear transparency for how this model is trained, evaluated, and what miners need to do to optimize it.

This puts MyShell at a distinct advantage. Coupled with the winner-takes-most mechanism, I would be worried about the decentralization of this subnet.

1/10

Innovation

Most of the models are based on models developed by MyShell, with just a few added tweaks.

Given that the output is not used anywhere that I can see, I think there is little innovation happening here.

2/10

Community Engagement / Transparency / Subnet Operations

Very little engagement in the channel. I had nearly all my questions ignored by the subnet 3 team, which would have helped clarify the above points.

Responses are few and far between, with the subnet developer being quite inactive.

Speakers are changed less frequently than promised.

2/10

Advice to improve

More frequent dev updates, with more communication with the community
Make the validator and miner dashboards more transparent, with greater insight into the subnet and how we're improving decentralized AI.
Products, or way to instantly use the subnet output
More varied competitions

MyShell TTS

Overall Score:

Product / Usecases:

Incentive Mechanism:

Code Quality:

Miner Competition:

Potential:

Contributes to Decentralized AI:

Subnet Utilities:

Decentralization:

Community Engagement:

TTS training subnet

The Subnet's Mission Statement

Product / usecases from the subnet

Incentive mechanism

Code quality

Miner competition

Potential

Contributes to decentralized AI

Subnet utilities (e.g., dashboards, insight for miners and validators, etc.)

Decentralisation

Innovation

Community Engagement / Transparency / Subnet Operations

Advice to improve