1: \begin{abstract}
2: In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent.
3: A common approach involves leveraging generative models to enhance adapters for controlled generation.
4: However, control signals (\eg, text, audio, reference image, pose, depth map, \etc) can vary in strength.
5: Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions.
6: In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image.
7: However, direct training with weak signals often leads to difficulties in convergence.
8: To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation.
9: Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio.
10: The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio.
11: Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.
12: \end{abstract}
13: