Lingzhi Zhao

MM
h-index6
4papers
24citations
Novelty51%
AI Score42

4 Papers

55.8MMApr 9
QoS-QoE Translation with Large Language Model

Yingjie Yu, Mingyuan Wu, Ahmadreza Eslaminia et al.

QoS-QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user-perceived experience. Although many prior studies have examined this relationship, their findings are often developed for specific setups and remain scattered across papers, experimental settings, and reporting formats, limiting systematic reuse, cross-scenario generalization, and large-scale analysis. To address this gap, we first introduce QoS-QoE Translation dataset, a source-grounded dataset of structured QoS-QoE relationships from the multimedia literature, with a focus on video streaming related tasks. We construct the dataset through an automated pipeline that combines paper curation, QoS-QoE relationship extraction, and iterative data evaluation. Each record preserves the extracted relationship together with parameter definitions, supporting evidence, and contextual metadata. We further evaluate the capability of large language models (LLMs) on QoS-QoE translation, both before and after supervised fine-tuning on our dataset, and show strong performance on both continuous-value and discrete-label prediction in bidirectional translation, from QoS-QoE and QoE-QoS. Our dataset provides a foundation for benchmarking LLMs in QoS-QoE translation and for supporting future LLM-based reasoning for multimedia quality prediction and optimization. The complete dataset and code are publicly available at https://yyu6969.github.io/qos-qoe-translation-page/, for full reproducibility and open access.

HCSep 17, 2025
AquaVLM: Improving Underwater Situation Awareness with Mobile Vision Language Models

Beitong Tian, Lingzhi Zhao, Bo Chen et al.

Underwater activities like scuba diving enable millions annually to explore marine environments for recreation and scientific research. Maintaining situational awareness and effective communication are essential for diver safety. Traditional underwater communication systems are often bulky and expensive, limiting their accessibility to divers of all levels. While recent systems leverage lightweight smartphones and support text messaging, the messages are predefined and thus restrict context-specific communication. In this paper, we present AquaVLM, a tap-and-send underwater communication system that automatically generates context-aware messages and transmits them using ubiquitous smartphones. Our system features a mobile vision-language model (VLM) fine-tuned on an auto-generated underwater conversation dataset and employs a hierarchical message generation pipeline. We co-design the VLM and transmission, incorporating error-resilient fine-tuning to improve the system's robustness to transmission errors. We develop a VR simulator to enable users to experience AquaVLM in a realistic underwater environment and create a fully functional prototype on the iOS platform for real-world experiments. Both subjective and objective evaluations validate the effectiveness of AquaVLM and highlight its potential for personal underwater communication as well as broader mobile VLM applications.

CVJan 27, 2024
STAC: Leveraging Spatio-Temporal Data Associations For Efficient Cross-Camera Streaming and Analytics

Ragini Gupta, Lingzhi Zhao, Jiaxi Li et al.

In IoT based distributed network of cameras, real-time multi-camera video analytics is challenged by high bandwidth demands and redundant visual data, creating a fundamental tension where reducing data saves network overhead but can degrade model performance, and vice versa. We present STAC, a cross-cameras surveillance system that leverages spatio-temporal associations for efficient object tracking under constrained network conditions. STAC integrates multi-resolution feature learning, ensuring robustness under variable networked system level optimizations such as frame filtering, FFmpeg-based compression, and Region-of-Interest (RoI) masking, to eliminate redundant content across distributed video streams while preserving downstream model accuracy for object identification and tracking. Evaluated on NVIDIA's AICity Challenge dataset, STAC achieves a 76\% improvement in tracking accuracy and an 8.6x reduction in inference latency over a standard multi-object multi-camera tracking baseline (using YOLOv4 and DeepSORT). Furthermore, 29\% of redundant frames are filtered, significantly reducing data volume without compromising inference quality.

MMJul 20, 2021
Adaptive Streaming of 360 Videos with Perfect, Imperfect, and Unknown FoV Viewing Probabilities in Wireless Networks

Lingzhi Zhao, Ying Cui, Zhi Liu et al.

This paper investigates adaptive streaming of one or multiple tiled 360 videos from a multi-antenna base station (BS) to one or multiple single-antenna users, respectively, in a multi-carrier wireless system. We aim to maximize the video quality while keeping rebuffering time small via encoding rate adaptation at each group of pictures (GOP) and transmission adaptation at each (transmission) slot. To capture the impact of field-of-view (FoV) prediction, we consider three cases of FoV viewing probability distributions, i.e., perfect, imperfect, and unknown FoV viewing probability distributions, and use the average total utility, worst average total utility, and worst total utility as the respective performance metrics. In the single-user scenario, we optimize the encoding rates of the tiles, encoding rates of the FoVs, and transmission beamforming vectors for all subcarriers to maximize the total utility in each case. In the multi-user scenario, we adopt rate splitting with successive decoding and optimize the encoding rates of the tiles, encoding rates of the FoVs, rates of the common and private messages, and transmission beamforming vectors for all subcarriers to maximize the total utility in each case. Then, we separate the challenging optimization problem into multiple tractable problems in each scenario. In the single-user scenario, we obtain a globally optimal solution of each problem using transformation techniques and the Karush-Kuhn-Tucker (KKT) conditions. In the multi-user scenario, we obtain a KKT point of each problem using the concave-convex procedure (CCCP). Finally, numerical results demonstrate that the proposed solutions achieve notable gains over existing schemes in all three cases. To the best of our knowledge, this is the first work revealing the impact of FoV prediction on the performance of adaptive streaming of tiled 360 videos.