abstract:b5977dad8ca5e3ab.tex

1: \begin{abstract}

2: In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent.

3: A common approach involves leveraging generative models to enhance adapters for controlled generation.

4: However, control signals (\eg, text, audio, reference image, pose, depth map, \etc) can vary in strength.

5: Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions.

6: In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image.

7: However, direct training with weak signals often leads to difficulties in convergence.

8: To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation.

9: Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio.

10: The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio.

11: Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.

12: \end{abstract}

13: