CVJun 12, 2023
LUT-GCE: Lookup Table Global Curve Estimation for Fast Low-light Image EnhancementChangguang Wu, Jiangxin Dong, Jinhui Tang
We present an effective and efficient approach for low-light image enhancement, named Lookup Table Global Curve Estimation (LUT-GCE). In contrast to existing curve-based methods with pixel-wise adjustment, we propose to estimate a global curve for the entire image that allows corrections for both under- and over-exposure. Specifically, we develop a novel cubic curve formulation for light enhancement, which enables an image-adaptive and pixel-independent curve for the range adjustment of an image. We then propose a global curve estimation network (GCENet), a very light network with only 25.4k parameters. To further speed up the inference speed, a lookup table method is employed for fast retrieval. In addition, a novel histogram smoothness loss is designed to enable zero-shot learning, which is able to improve the contrast of the image and recover clearer details. Quantitative and qualitative results demonstrate the effectiveness of the proposed approach. Furthermore, our approach outperforms the state of the art in terms of inference speed, especially on high-definition images (e.g., 1080p and 4k).
CVSep 15, 2023
Dynamic Visual Semantic Sub-Embeddings and Fast Re-RankingWenzhang Wei, Zhipeng Gui, Changguang Wu et al.
The core of cross-modal matching is to accurately measure the similarity between different modalities in a unified representation space. However, compared to textual descriptions of a certain perspective, the visual modality has more semantic variations. So, images are usually associated with multiple textual captions in databases. Although popular symmetric embedding methods have explored numerous modal interaction approaches, they often learn toward increasing the average expression probability of multiple semantic variations within image embeddings. Consequently, information entropy in embeddings is increased, resulting in redundancy and decreased accuracy. In this work, we propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy. Specifically, we obtain a set of heterogeneous visual sub-embeddings through dynamic orthogonal constraint loss. To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution and employ a variance-aware weighting loss to assign different weights to the optimization process. In addition, we develop a Fast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and enhance the performance. We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role of different components by ablation studies and perform a sensitivity analysis of the hyperparameters. The qualitative analysis of visualized bidirectional retrieval and attention maps further demonstrates the ability of our method to encode semantic variations.
47.8IRMay 5
Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and FusionHuatuan Sun, Yunshan Ma, Changguang Wu et al.
Frozen Large Video Language Models (LVLMs) are increasingly employed in micro-video recommendation due to their strong multimodal understanding. However, their integration lacks systematic empirical evaluation: practitioners typically deploy LVLMs as fixed black-box feature extractors without systematically comparing alternative representation strategies. To address this gap, we present the first systematic empirical study along two key design dimensions: (i) integration strategies with ID embeddings, specifically replacement versus fusion, and (ii) feature extraction paradigms, comparing LVLM-generated captions with intermediate decoder hidden states. Extensive experiments on representative LVLMs reveal three key principles: (1) intermediate hidden states consistently outperform caption-based representations, as natural-language summarization inevitably discards fine-grained visual semantics crucial for recommendation; (2) ID embeddings capture irreplaceable collaborative signals, rendering fusion strictly superior to replacement; and (3) the effectiveness of intermediate decoder features varies significantly across layers. Guided by these insights, we propose the Dual Feature Fusion (DFF) Framework, a lightweight and plug-and-play approach that adaptively fuses multi-layer representations from frozen LVLMs with item ID embeddings. DFF achieves state-of-the-art performance on two real-world micro-video recommendation benchmarks, consistently outperforming strong baselines and providing a principled approach to integrating off-the-shelf large vision-language models into micro-video recommender systems.