Wednesday, August 7, 2024

Offline Real-time Natural Sounding Speech Synthesis

 

Offline Real-time Natural Sounding Speech Synthesis

 

Present Embedded solution is not very natural sounding and hard to support new languages, ascent and voices. Neural Network based solution addresses above issues but they are not real-time on low foot print devices and have huge models requiring more memory.

NN Based Solution usually contains two Parts:

1.       Mel spectrogram Generator

Which generates Mel Spectrogram (important acoustic features that the human brain uses when processing speech) from Text.

2.       Vocoder

Speech Synthesizer that generates speech waveform from Mel spectrogram.



 


 

Based on the our real-time and low footprint requirement we are exploring various option of Mel Spectrogram Generator and Vocoder

Solutions

1.       Tacatron2 + LPCNET

2.       FastSpeech + LPCNET

3.       FastSpeech +  Squeezewave

 

 

1.      Tacotron2 + LPCNET

 



Tacotron2 is End to End TTS with Wavenet Vocoder, to make it Realtime we done following optimization

a)       Wavenet Vocoder Changed to LPCNET

b)      Retraining the Tacatron2 with Only 20 Mels

c)       Memory Mapped Model

d)      Streaming LPCNET output

 

 

A prototype app that runs in Android platform produces output quite fast, but still it not real-time but still acceptable for small utterances. In addition, more training is required for better natural sound.

 

On Qualcomm 820 Board

Audio Sample

Mel Spectogram Gen

Vocoder

Total Time

12 Sec

15s

10s

25s

3 Sec

2s

1.5s

3.5s

 

On X86 Single Core

Audio Sample

Mel Spectogram Gen

Vocoder

Total Time

12 Sec

6s

4s

10s

3 Sec

 

 

 

 

 

2.      FastSpeech + Squeezewave

 

Fastspeech is astonishing fast mel generator.



Combining it with Squeezewave Vocoder on X86 single Core CPU we have attained following Timings.

On X86 Single Core

Audio Sample

Mel Spectogram Gen

Vocoder

Total Time

12 Sec

1.2s

2.7s

3.9s

3 Sec

0.5s

0.8s

1.3s

 

But Both of the Implementation is in Pytorch ( Python ) and converting model that be inferenced via c/c++ code is little difficult .

 

1.       Converting model to Torch Script and run inference via c/c++

2.       Converting the both Fastspeech and Squeezewave to Tensorflow Implemenation

Then use the model with c/c++ code.

 

 

3.      FastSpeech + LPCNET

 

Because of complexity of converting model from pytorch to torch script, and proven LPCNET performance. We are trying to integrated Fastspeech Mel spectrogram generation with LPCNET vocoder.

Since LPCNET Vocoder only works with fewer Mel, we need to modify the FastSpeech and Retrain The Model to be used with LPCNET.

 

 

 

 

 

No comments:

Post a Comment

Featured Post

XDP - Getting Started with XDP (Linux)