Offline Real-time Natural Sounding Speech Synthesis

Present Embedded solution is not very natural sounding and hard to support new languages, ascent and voices. Neural Network based solution addresses above issues but they are not real-time on low foot print devices and have huge models requiring more memory.

NN Based Solution usually contains two Parts:

1. Mel spectrogram Generator

Which generates Mel Spectrogram (important acoustic features that the human brain uses when processing speech) from Text.

2. Vocoder

Speech Synthesizer that generates speech waveform from Mel spectrogram.

Based on the our real-time and low footprint requirement we are exploring various option of Mel Spectrogram Generator and Vocoder

Solutions

1. Tacatron2 + LPCNET

2. FastSpeech + LPCNET

3. FastSpeech + Squeezewave

1. Tacotron2 + LPCNET

Tacotron2 is End to End TTS with Wavenet Vocoder, to make it Realtime we done following optimization

a) Wavenet Vocoder Changed to LPCNET

b) Retraining the Tacatron2 with Only 20 Mels

c) Memory Mapped Model

d) Streaming LPCNET output

A prototype app that runs in Android platform produces output quite fast, but still it not real-time but still acceptable for small utterances. In addition, more training is required for better natural sound.

On Qualcomm 820 Board

Audio Sample	Mel Spectogram Gen	Vocoder	Total Time
12 Sec	15s	10s	25s
3 Sec	2s	1.5s	3.5s

On X86 Single Core

Audio Sample	Mel Spectogram Gen	Vocoder	Total Time
12 Sec	6s	4s	10s
3 Sec

2. FastSpeech + Squeezewave

Fastspeech is astonishing fast mel generator.

Combining it with Squeezewave Vocoder on X86 single Core CPU we have attained following Timings.

On X86 Single Core

Audio Sample	Mel Spectogram Gen	Vocoder	Total Time
12 Sec	1.2s	2.7s	3.9s
3 Sec	0.5s	0.8s	1.3s

But Both of the Implementation is in Pytorch ( Python ) and converting model that be inferenced via c/c++ code is little difficult .

1. Converting model to Torch Script and run inference via c/c++

2. Converting the both Fastspeech and Squeezewave to Tensorflow Implemenation

Then use the model with c/c++ code.

3. FastSpeech + LPCNET

Because of complexity of converting model from pytorch to torch script, and proven LPCNET performance. We are trying to integrated Fastspeech Mel spectrogram generation with LPCNET vocoder.

Since LPCNET Vocoder only works with fewer Mel, we need to modify the FastSpeech and Retrain The Model to be used with LPCNET.

.

Wednesday, August 7, 2024