b5977dad8ca5e3ab.tex
1: \begin{abstract}
2: In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. 
3: A common approach involves leveraging generative models to enhance adapters for controlled generation. 
4: However, control signals (\eg, text, audio, reference image, pose, depth map, \etc) can vary in strength.
5: Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. 
6: In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. 
7: However, direct training with weak signals often leads to difficulties in convergence. 
8: To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. 
9: Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. 
10: The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. 
11: Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths. 
12: \end{abstract}
13: