0fa0b8700a8eeb07.tex
1: \begin{abstract}
2: Instruction tuning in multimodal large language models (MLLMs) aims to smoothly integrate a backbone LLM with a pre-trained feature encoder for downstream tasks.
3: The major challenge is how to efficiently find the synergy through cooperative learning where LLMs adapt their 
4: reasoning abilities 
5: in downstream tasks
6: while feature encoders adjust their encoding to provide more relevant modal information. 
7: In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives, 
8: where we find unbalanced learning between the two components, i.e., the feature encoder and the LLM, 
9: can cause diminishing learning gradients that slow the model convergence and often lead to sub-optimal results due to insufficient learning.
10: Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance, based on which we further design a dynamic learning scheduler that better coordinates the learning. 
11: In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs considering the learning state of each model component, 
12: which potentially prevents each component from gradient diminishing and enables a more accurate estimation of the learning balance coefficient. 
13: We conduct experiments with multiple LLM backbones and feature encoders, where our techniques are model-agnostic and can be generically integrated with various MLLM backbones.
14: Experiment results on multiple downstream tasks and modalities in vision and audio, 
15: demonstrate the proposed method's better efficiency and effectiveness in MLLM instruction tuning.
16: \end{abstract}
17: