Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

Tianhua Qi1,2, Shiyan Wang1,2, Cheng Lu1,2, Yan Zhao1, Yuan Zong1,2, Wenming Zheng1,2*
1 Key Laboratory of Child Development and Learning Science (Southeast University), Ministry of Education, Nanjing 210096, China
2 School of Biological Science and Medical Engineering, Southeast University, China

1. Abstract

Realistic emotional voice conversion (EVC) aims to enhance emotional diversity of converted audios, making the synthesized voices more authentic and natural. To this end, we propose Emotional Intensity-aware Network (EINet), dynamically adjusting intonation and rhythm by incorporating controllable emotional intensity. To better capture nuances in emotional intensity, we go beyond mere distance measurements among acoustic features. Instead, an emotion evaluator is utilized to precisely quantify speaker's emotional state. By employing an intensity mapper, intensity pseudo-labels are obtained to bridge the gap between emotional speech intensity modeling and runtime conversion. To ensure high speech quality while retaining controllability, an emotion renderer is used for combining linguistic features smoothly with manipulated emotional features at frame level. Furthermore, we employ a duration predictor to facilitate adaptive prediction of rhythm changes condition on specifying intensity value. Experimental results show EINet's superior performance in naturalness and diversity of emotional expression compared to state-of-the-art EVC methods.



2. Experimental Results

Original EVC Evaluation

Source VITS-EVC EINet (proposed) Target Text
Neutral-to-Angry
不管怎么说,主队好像是志在夺魁。(English: Anyway, it seems like the home team is determined to win the championship.)
我们乘船漂游了三峡,真是刺激。(English: We took a boat trip through the Three Gorges, which was truly thrilling.)
我每个月打一次电话。(English: I make a phone call once a month.)
个人收藏家,他们肯定有!(English: Private collectors, they definitely have!)
我特别喜欢网球和登山。(English: I especially like tennis and mountaineering.)
Neutral-to-Happy
不管怎么说,主队好像是志在夺魁。(English: Anyway, it seems like the home team is determined to win the championship.)
我们乘船漂游了三峡,真是刺激。(English: We took a boat trip through the Three Gorges, which was truly thrilling.)
我每个月打一次电话。(English: I make a phone call once a month.)
个人收藏家,他们肯定有!(English: Private collectors, they definitely have!)
我特别喜欢网球和登山。(English: I especially like tennis and mountaineering.)
Neutral-to-Sad
不管怎么说,主队好像是志在夺魁。(English: Anyway, it seems like the home team is determined to win the championship.)
我们乘船漂游了三峡,真是刺激。(English: We took a boat trip through the Three Gorges, which was truly thrilling.)
我每个月打一次电话。(English: I make a phone call once a month.)
个人收藏家,他们肯定有!(English: Private collectors, they definitely have!)
我特别喜欢网球和登山。(English: I especially like tennis and mountaineering.)
Neutral-to-Surprise
不管怎么说,主队好像是志在夺魁。(English: Anyway, it seems like the home team is determined to win the championship.)
我们乘船漂游了三峡,真是刺激。(English: We took a boat trip through the Three Gorges, which was truly thrilling.)
我每个月打一次电话。(English: I make a phone call once a month.)
个人收藏家,他们肯定有!(English: Private collectors, they definitely have!)
我特别喜欢网球和登山。(English: I especially like tennis and mountaineering.)

Controllable EVC Evaluation

Source Intensity=0.2 (Weak) Intensity=0.6 (Medium) Intensity=0.8 (Strong) Text
Neutral-to-Angry
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.)
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.)
我老家在北京。(English: My hometown is in Beijing.)
Neutral-to-Happy
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.)
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.)
我老家在北京。(English: My hometown is in Beijing.)
Neutral-to-Sad
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.)
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.)
我老家在北京。(English: My hometown is in Beijing.)
Neutral-to-Surprise
打远一看,它们的确很是美丽。(English: At a distance, they indeed appear very beautiful.)
妇女节快乐,我永远爱你,妈妈。(English: Happy Women's Day, I will always love you, Mom.)
我老家在北京。(English: My hometown is in Beijing.)

Visualization of pitch and energy tracks.

To showcase the controllability of emotional intensity, we visualize pitch and energy tracks of voicing parts in testing clips (from neutral to happy). It can be observed that as emotional intensity increases, i.e., the induction of emotional states progresses from weak to strong, there is a concurrent broadening of pitch fluctuation and an elevation in peak energy.



Visualization of converted audios at different emotional intensities.

Furthermore, we presents synthesized Mel-spectrograms with F0 contours, demonstrating that with an increase in emotional intensity, the acoustic variation becomes more pronounced, coupled with more short pauses. This implies that EINet can adaptively convey intrinsic emotional states based on controllable emotional intensity, achieving optimal outcomes in both intonation and rhythm synthesis.