abstract:0fa0b8700a8eeb07.tex

1: \begin{abstract}

2: Instruction tuning in multimodal large language models (MLLMs) aims to smoothly integrate a backbone LLM with a pre-trained feature encoder for downstream tasks.

3: The major challenge is how to efficiently find the synergy through cooperative learning where LLMs adapt their

4: reasoning abilities

5: in downstream tasks

6: while feature encoders adjust their encoding to provide more relevant modal information.

7: In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives,

8: where we find unbalanced learning between the two components, i.e., the feature encoder and the LLM,

9: can cause diminishing learning gradients that slow the model convergence and often lead to sub-optimal results due to insufficient learning.

10: Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance, based on which we further design a dynamic learning scheduler that better coordinates the learning.

11: In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs considering the learning state of each model component,

12: which potentially prevents each component from gradient diminishing and enables a more accurate estimation of the learning balance coefficient.

13: We conduct experiments with multiple LLM backbones and feature encoders, where our techniques are model-agnostic and can be generically integrated with various MLLM backbones.

14: Experiment results on multiple downstream tasks and modalities in vision and audio,

15: demonstrate the proposed method's better efficiency and effectiveness in MLLM instruction tuning.

16: \end{abstract}

17: