Wing-Kwong Chan

CV
h-index7
8papers
285citations
Novelty54%
AI Score31

8 Papers

CVJun 27, 2023Code
GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

Zhijian Hou, Lei Ji, Difei Gao et al.

In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released at\url{https://github.com/houzhijian/GroundNLQ}.

CVSep 22, 2022Code
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Zhijian Hou, Wanjun Zhong, Lei Ji et al.

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13% to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.

CVJun 28, 2023
Incremental Learning on Food Instance Segmentation

Huu-Thanh Nguyen, Yu Cao, Chong-Wah Ngo et al.

Food instance segmentation is essential to estimate the serving size of dishes in a food image. The recent cutting-edge techniques for instance segmentation are deep learning networks with impressive segmentation quality and fast computation. Nonetheless, they are hungry for data and expensive for annotation. This paper proposes an incremental learning framework to optimize the model performance given a limited data labelling budget. The power of the framework is a novel difficulty assessment model, which forecasts how challenging an unlabelled sample is to the latest trained instance segmentation model. The data collection procedure is divided into several stages, each in which a new sample package is collected. The framework allocates the labelling budget to the most difficult samples. The unlabelled samples that meet a certain qualification from the assessment model are used to generate pseudo-labels. Eventually, the manual labels and pseudo-labels are sent to the training data to improve the instance segmentation model. On four large-scale food datasets, our proposed framework outperforms current incremental learning benchmarks and achieves competitive performance with the model trained on fully annotated samples.

CVMay 8, 2022
Cross-lingual Adaptation for Recipe Retrieval with Mixup

Bin Zhu, Chong-Wah Ngo, Jingjing Chen et al.

Cross-modal recipe retrieval has attracted research attention in recent years, thanks to the availability of large-scale paired data for training. Nevertheless, obtaining adequate recipe-image pairs covering the majority of cuisines for supervised learning is difficult if not impossible. By transferring knowledge learnt from a data-rich cuisine to a data-scarce cuisine, domain adaptation sheds light on this practical problem. Nevertheless, existing works assume recipes in source and target domains are mostly originated from the same cuisine and written in the same language. This paper studies unsupervised domain adaptation for image-to-recipe retrieval, where recipes in source and target domains are in different languages. Moreover, only recipes are available for training in the target domain. A novel recipe mixup method is proposed to learn transferable embedding features between the two domains. Specifically, recipe mixup produces mixed recipes to form an intermediate domain by discretely exchanging the section(s) between source and target recipes. To bridge the domain gap, recipe mixup loss is proposed to enforce the intermediate domain to locate in the shortest geodesic path between source and target domains in the recipe embedding space. By using Recipe 1M dataset as source domain (English) and Vireo-FoodTransfer dataset as target domain (Chinese), empirical experiments verify the effectiveness of recipe mixup for cross-lingual adaptation in the context of image-to-recipe retrieval.

CVNov 16, 2022
An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

Zhijian Hou, Wanjun Zhong, Lei Ji et al.

This technical report describes the CONE approach for Ego4D Natural Language Queries (NLQ) Challenge in ECCV 2022. We leverage our model CONE, an efficient window-centric COarse-to-fiNE alignment framework. Specifically, CONE dynamically slices the long video into candidate windows via a sliding window approach. Centering at windows, CONE (1) learns the inter-window (coarse-grained) semantic variance through contrastive learning and speeds up inference by pre-filtering the candidate windows relevant to the NL query, and (2) conducts intra-window (fine-grained) candidate moments ranking utilizing the powerful multi-modal alignment ability of the contrastive vision-text pre-trained model EgoVLP. On the blind test set, CONE achieves 15.26 and 9.24 for R1@IoU=0.3 and R1@IoU=0.5, respectively.

CVJul 1, 2022
(Un)likelihood Training for Interpretable Embedding

Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan et al.

Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data. Learning modality agnostic representations in a continuous latent space, however, is often treated as a black-box data-driven training process. It is well-known that the effectiveness of representation learning depends heavily on the quality and scale of training data. For video representation learning, having a complete set of labels that annotate the full spectrum of video content for training is highly difficult if not impossible. These issues, black-box training and dataset bias, make representation learning practically challenging to be deployed for video understanding due to unexplainable and unpredictable results. In this paper, we propose two novel training objectives, likelihood and unlikelihood functions, to unroll semantics behind embeddings while addressing the label sparsity problem in training. The likelihood training aims to interpret semantics of embeddings beyond training labels, while the unlikelihood training leverages prior knowledge for regularization to ensure semantically coherent interpretation. With both training objectives, a new encoder-decoder network, which learns interpretable cross-modal representation, is proposed for ad-hoc video search. Extensive experiments on TRECVid and MSR-VTT datasets show the proposed network outperforms several state-of-the-art retrieval models with a statistically significant performance margin.

CVApr 9, 2024
Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan

Aligning a user query and video clips in cross-modal latent space and that with semantic concepts are two mainstream approaches for ad-hoc video search (AVS). However, the effectiveness of existing approaches is bottlenecked by the small sizes of available video-text datasets and the low quality of concept banks, which results in the failures of unseen queries and the out-of-vocabulary problem. This paper addresses these two problems by constructing a new dataset and developing a multi-word concept bank. Specifically, capitalizing on a generative model, we construct a new dataset consisting of 7 million generated text and video pairs for pre-training. To tackle the out-of-vocabulary problem, we develop a multi-word concept bank based on syntax analysis to enhance the capability of a state-of-the-art interpretable AVS method in modeling relationships between query words. We also study the impact of current advanced features on the method. Experimental results show that the integration of the above-proposed elements doubles the R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2% to 77%, with an average about 20%.

SEAug 11, 2020
RegionTrack: A Trace-based Sound and Complete Checker to Debug Transactional Atomicity Violations and Non-Serializable Traces

Xiaoxue Ma, Shangru Wu, Ernest Pobee et al.

Atomicity is a correctness criterion to reason about isolated code regions in a multithreaded program when they are executed concurrently. However, dynamic instances of these code regions, called transactions, may fail to behave atomically, resulting in transactional atomicity violations. Existing dynamic online atomicity checkers incur either false positives or false negatives in detecting transactions experiencing transactional atomicity violations. This paper proposes RegionTrack. RegionTrack tracks cross-thread dependences at the event, dynamic subregion, and transaction levels. It maintains both dynamic subregions within selected transactions and transactional happens-before relations through its novel timestamp propagation approach. We prove that RegionTrack is sound and complete in detecting both transactional atomicity violations and non-serializable traces. To the best of our knowledge, it is the first online technique that precisely captures the transitively closed set of happens-before relations over all conflicting events with respect to every running transaction for the above two kinds of issues. We have evaluated RegionTrack on 19 subjects of the DaCapo and the Java Grande Forum benchmarks. The empirical results confirm that RegionTrack precisely detected all those transactions which experienced transactional atomicity violations and identified all non-serializable traces. The overall results also show that RegionTrack incurred 1.10x and 1.08x lower memory and runtime overheads than Velodrome and 2.10x and 1.21x lower than Aerodrome, respectively. Moreover, it incurred 2.89x lower memory overhead than DoubleChecker. On average, Velodrome detected about 55% fewer violations than RegionTrack, which in turn reported about 3%-70% fewer violations than DoubleChecker.