abstract:25a9f773de4d74a9.tex

1: \begin{abstract}

2: % 第一句： 大语言模型在边缘计算的大背景

3: % 第二句：Serving LLMs 需要解决CAP问题

4: % 第三句：解决CAP问题是困难的

5: % 第四句：我们目前

6: Large Language Models (LLMs) can perform zero-shot learning on unseen tasks and few-shot learning on complex reasoning tasks. However, resource-limited mobile edge networks struggle to support long-context LLM serving for LLM agents during multi-round interactions with users. Unlike stateless computation offloading and static service offloading in edge computing, optimizing LLM serving at edge servers is challenging because LLMs continuously learn from context which raises accuracy, latency, and resource consumption dynamics.

7: In this paper, we propose a joint model caching and inference offloading framework that utilizes test-time deep reinforcement learning (T2DRL) to optimize deployment and execution strategies for long-context LLM serving. In this framework, we analyze the performance convergence and design an optimization problem considering the utilization of context windows in LLMs. Furthermore, the T2DRL algorithm can learn in both the training phase and the testing phase to proactively manage cached models and service requests and adapt to context changes and usage patterns during execution. To further enhance resource allocation efficiency, we propose a double Dutch auction (DDA) mechanism, which dynamically matches supply and demand while maximizing social welfare. Finally, experimental results demonstrate that the T2DRL algorithm can reduce system costs by at least 30\% compared to baselines while guaranteeing the performance of LLM agents in real-world perception and reasoning tasks.

8: % Previous frameworks for computation and service offloading are inadequate for optimizing LLM services as they fail to capture the dynamic learning capabilities of LLMs and the need for service adaptation based on evolving model contexts.

9: \end{abstract}

10: