PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

Tianhua Qi1,2, Wenming Zheng1,2*, Cheng Lu1,2, Yuan Zong1,2, Hailun Lian1
1 Key Laboratory of Child Development and Learning Science (Southeast University), Ministry of Education, Nanjing 210096, China
2 School of Biological Science and Medical Engineering, Southeast University, China

1. Abstract

In this paper, we propose Prosody-aware VITS (PAVITS) for emotional voice conversion (EVC), aiming to achieve two major objectives of EVC: high content naturalness and high emotional naturalness, which are crucial for meeting the demands of human perception. To improve the content naturalness of converted audio, we have developed an end-to-end EVC architecture inspired by the high audio quality of VITS. By seamlessly integrating an acoustic converter and vocoder, we effectively address the common issue of mismatch between emotional prosodic training and run-time conversion that is prevalent in existing EVC models. To further enhance the emotional naturalness, we introduce an emotion descriptor to model the subtle prosodic variations of different speech emotions, a prosody predictor to predict prosodic features from text based on the provided emotion label. Among them, we introduce a prosody alignment loss to establish a connection between latent prosodic features from two distinct modalities, ensuring effective training. Experimental results show that the performance of PAVITS is superior to the state-of-the-art EVC methods.



2. Experimental Results

Source CycleGAN StarGAN VITS (VL approach shown here) PAVITS-FL (proposed) PAVITS-VL (proposed) Target Text
Neutral-to-Angry
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.)
我老家在北京。(English: My hometown is in Beijing.)
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.)
周末的我,只忙着陪你。(English: On the weekend, I'm only busy with accompanying you.)
谁你也不认识,我很乐意帮助你。(English: You don't know anybody, but I'm more than willing to help you.)
不就是你嘛,为什么要偷笑来。(English: Isn't it just you? Why are you sneaking a laugh?)
Neutral-to-Happy
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.)
我老家在北京。(English: My hometown is in Beijing.)
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.)
周末的我,只忙着陪你。(English: On the weekend, I'm only busy with accompanying you.)
英国的哲学家曾经说过。(English: The British philosopher once said.)
很快你上大学就用得到了。(English: You will need it as soon as you enter university.)
Neutral-to-Sad
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.)
我老家在北京。(English: My hometown is in Beijing.)
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.)
周末的我,只忙着陪你。(English: On the weekend, I'm only busy with accompanying you.)
沙尘暴好像给每个人都带来了麻烦!(English: The sandstorm seems to have brought trouble to everyone!)
他一定是一眼就被你迷住了。(English: He must have been captivated by you at first sight.)
Neutral-to-Surprise
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.)
我老家在北京。(English: My hometown is in Beijing.)
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.)
周末的我,只忙着陪你。(English: On the weekend, I'm only busy with accompanying you.)
英国的哲学家曾经说过。(English: The British philosopher once said.)
谁你也不认识,我很乐意帮助你。(English: You don't know anybody, but I'm more than willing to help you.)

Visualization of converted waveform.

To further show the effectiveness of our method, we visualize the spectrogram of testing clips (Neutral-to-Happy shown here). From top to bottom are the ground truth spectrogram, the spectrogram converted by the original VITS method, and the spectrogram converted by proposed PAVITS method. It is readily apparent that the spectrogram converted by PAVITS exhibits finer details in prosody variations within the pertinent frequency bands, while simultaneously preserving descriptive information for other frequency bands.