NI LGMay 6, 2025

Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks

Junhe Zhang, Wanli Ni, Pengwei Wang, Dongyu Wang

arXiv:2505.07841v32.31 citationsh-index: 10Has CodeIEEE Wireless Communications Letters

Originality Incremental advance

AI Analysis

This work addresses efficiency issues in resource-constrained multiuser networks for large model-based agents, representing an incremental improvement with specific gains in communication optimization.

The paper tackles the problem of high bandwidth and latency from long token embeddings in transformer-based agents by proposing a task-oriented multimodal token transmission scheme with token compression and joint optimization of resources. Simulation results show the algorithm outperforms baselines in bandwidth and power budgets and achieves higher accuracy with cross-modal alignment across signal-to-noise ratios.

With the emergence of large model-based agents, widely adopted transformer-based architectures inevitably produce excessively long token embeddings for transmission, which may result in high bandwidth overhead, increased power consumption and latency. In this letter, we propose a task-oriented multimodal token transmission scheme for efficient multimodal information fusion and utilization. To improve the efficiency of token transmission, we design a two-stage training algotithm, including cross-modal alignment and task-oriented fine-tuning, for large model-based token communication. Meanwhile, token compression is performed using a sliding window pooling operation to save communication resources. To balance the trade-off between latency and model performance caused by compression, we formulate a weighted-sum optimization problem over latency and validation loss. We jointly optimizes bandwidth, power allocation, and token length across users by using an alternating optimization method. Simulation results demonstrate that the proposed algorithm outperforms the baseline under different bandwidth and power budgets. Moreover, the two-stage training algorithm achieves higher accuracy across various signal-to-noise ratios than the method without cross-modal alignment.

View on arXiv PDF Code

Similar