Offline Real-time Natural Sounding Speech Synthesis
Present Embedded solution is not
very natural sounding and hard to support new languages, ascent and voices.
Neural Network based solution addresses above issues but they are not real-time
on low foot print devices and have huge models requiring more memory.
NN Based Solution usually contains
two Parts:
1.
Mel
spectrogram Generator
Which generates Mel
Spectrogram (important acoustic features that the human brain uses when
processing speech) from Text.
2.
Vocoder
Speech Synthesizer
that generates speech waveform from Mel spectrogram.
Based on the our real-time and low footprint requirement we
are exploring various option of Mel Spectrogram Generator and Vocoder
Solutions
1.
Tacatron2 + LPCNET
2.
FastSpeech + LPCNET
3.
FastSpeech +
Squeezewave
1.
Tacotron2 + LPCNET
Tacotron2 is End to End TTS with Wavenet
Vocoder, to make it Realtime we done following optimization
a)
Wavenet Vocoder Changed to LPCNET
b)
Retraining the Tacatron2 with Only 20 Mels
c)
Memory Mapped Model
d)
Streaming LPCNET output
A prototype app that runs in Android platform
produces output quite fast, but still it not real-time but still acceptable for
small utterances. In addition, more training is required for better natural
sound.
On
Qualcomm 820 Board
Audio Sample |
Mel Spectogram Gen |
Vocoder |
Total Time |
12 Sec |
15s |
10s |
25s |
3 Sec |
2s |
1.5s |
3.5s |
On X86 Single Core
Audio Sample |
Mel Spectogram Gen |
Vocoder |
Total Time |
12 Sec |
6s |
4s |
10s |
3 Sec |
|
|
|
2.
FastSpeech + Squeezewave
Fastspeech is astonishing fast mel generator.
Combining it with Squeezewave
Vocoder on X86 single Core CPU we have attained following Timings.
On
X86 Single Core
Audio Sample |
Mel Spectogram Gen |
Vocoder |
Total Time |
12 Sec |
1.2s |
2.7s |
3.9s |
3 Sec |
0.5s |
0.8s |
1.3s |
But Both of the Implementation is in Pytorch
( Python ) and converting model that be inferenced via c/c++ code is little
difficult .
1.
Converting model to Torch Script and run
inference via c/c++
2.
Converting the both Fastspeech and Squeezewave
to Tensorflow Implemenation
Then
use the model with c/c++ code.
3.
FastSpeech + LPCNET
Because of complexity of converting model from pytorch to
torch script, and proven LPCNET performance. We are trying to integrated
Fastspeech Mel spectrogram generation with LPCNET vocoder.
Since LPCNET Vocoder only works with fewer Mel, we need to
modify the FastSpeech and Retrain The Model to be used with LPCNET.
No comments:
Post a Comment