1: \begin{abstract}
2: Instruction tuning in multimodal large language models (MLLMs) aims to smoothly integrate a backbone LLM with a pre-trained feature encoder for downstream tasks.
3: The major challenge is how to efficiently find the synergy through cooperative learning where LLMs adapt their
4: reasoning abilities
5: in downstream tasks
6: while feature encoders adjust their encoding to provide more relevant modal information.
7: In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives,
8: where we find unbalanced learning between the two components, i.e., the feature encoder and the LLM,
9: can cause diminishing learning gradients that slow the model convergence and often lead to sub-optimal results due to insufficient learning.
10: Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance, based on which we further design a dynamic learning scheduler that better coordinates the learning.
11: In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs considering the learning state of each model component,
12: which potentially prevents each component from gradient diminishing and enables a more accurate estimation of the learning balance coefficient.
13: We conduct experiments with multiple LLM backbones and feature encoders, where our techniques are model-agnostic and can be generically integrated with various MLLM backbones.
14: Experiment results on multiple downstream tasks and modalities in vision and audio,
15: demonstrate the proposed method's better efficiency and effectiveness in MLLM instruction tuning.
16: \end{abstract}
17: