PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion
1. Abstract
In this paper, we propose Prosody-aware VITS (PAVITS) for emotional voice conversion (EVC), aiming to achieve two major objectives of EVC: high content naturalness and high emotional naturalness, which are crucial for meeting the demands of human perception. To improve the content naturalness of converted audio, we have developed an end-to-end EVC architecture inspired by the high audio quality of VITS. By seamlessly integrating an acoustic converter and vocoder, we effectively address the common issue of mismatch between emotional prosodic training and run-time conversion that is prevalent in existing EVC models. To further enhance the emotional naturalness, we introduce an emotion descriptor to model the subtle prosodic variations of different speech emotions. Additionally, we propose a prosody predictor, which predicts prosody features from text based on the provided emotion label. Notably, we introduce a prosody alignment loss to establish a connection between latent prosody features from two distinct modalities, ensuring effective training. Experimental results show that the performance of PAVITS is superior to the state-of-the-art EVC methods.
2. Experimental Results
Source | CycleGAN | StarGAN | VITS (VL approach shown here) | PAVITS-FL (proposed) | PAVITS-VL (proposed) | Target | Text |
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.) | |||||||
我老家在北京。(English: My hometown is in Beijing.) | |||||||
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.) | |||||||
周末的我,只忙着陪你。(English: On the weekend, I'm only busy with accompanying you.) | |||||||
谁你也不认识,我很乐意帮助你。(English: You don't know anybody, but I'm more than willing to help you.) | |||||||
不就是你嘛,为什么要偷笑来。(English: Isn't it just you? Why are you sneaking a laugh?) | |||||||
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.) | |||||||
我老家在北京。(English: My hometown is in Beijing.) | |||||||
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.) | |||||||
周末的我,只忙着陪你。(English: On the weekend, I'm only busy with accompanying you.) | |||||||
英国的哲学家曾经说过。(English: The British philosopher once said.) | |||||||
很快你上大学就用得到了。(English: You will need it as soon as you enter university.) | |||||||
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.) | |||||||
我老家在北京。(English: My hometown is in Beijing.) | |||||||
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.) | |||||||
周末的我,只忙着陪你。(English: On the weekend, I'm only busy with accompanying you.) | |||||||
沙尘暴好像给每个人都带来了麻烦!(English: The sandstorm seems to have brought trouble to everyone!) | |||||||
他一定是一眼就被你迷住了。(English: He must have been captivated by you at first sight.) | |||||||
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.) | |||||||
我老家在北京。(English: My hometown is in Beijing.) | |||||||
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.) | |||||||
周末的我,只忙着陪你。(English: On the weekend, I'm only busy with accompanying you.) | |||||||
英国的哲学家曾经说过。(English: The British philosopher once said.) | |||||||
谁你也不认识,我很乐意帮助你。(English: You don't know anybody, but I'm more than willing to help you.) |
Visualization of converted waveform.
To further show the effectiveness of our method, we visualize the spectrogram of testing clips (Neutral-to-Happy shown here). From top to bottom are the ground truth spectrogram, the spectrogram converted by the original VITS method, and the spectrogram converted by proposed PAVITS method. It is readily apparent that the spectrogram converted by PAVITS exhibits finer details in prosody variations within the pertinent frequency bands, while simultaneously preserving descriptive information for other frequency bands.