Liang Gou

CV
h-index15
19papers
398citations
Novelty52%
AI Score42

19 Papers

CVJun 25, 2023Code
Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection

Thang Doan, Xin Li, Sima Behpour et al.

Open World Object Detection (OWOD) is a challenging and realistic task that extends beyond the scope of standard Object Detection task. It involves detecting both known and unknown objects while integrating learned knowledge for future tasks. However, the level of "unknownness" varies significantly depending on the context. For example, a tree is typically considered part of the background in a self-driving scene, but it may be significant in a household context. We argue that this contextual information should already be embedded within the known classes. In other words, there should be a semantic or latent structure relationship between the known and unknown items to be discovered. Motivated by this observation, we propose Hyp-OW, a method that learns and models hierarchical representation of known items through a SuperClass Regularizer. Leveraging this representation allows us to effectively detect unknown objects using a similarity distance-based relabeling module. Extensive experiments on benchmark datasets demonstrate the effectiveness of Hyp-OW, achieving improvement in both known and unknown detection (up to 6 percent). These findings are particularly pronounced in our newly designed benchmark, where a strong hierarchical structure exists between known and unknown objects. Our code can be found at https://github.com/boschresearch/Hyp-OW

CVAug 1, 2023
GradOrth: A Simple yet Efficient Out-of-Distribution Detection with Orthogonal Projection of Gradients

Sima Behpour, Thang Doan, Xin Li et al.

Detecting out-of-distribution (OOD) data is crucial for ensuring the safe deployment of machine learning models in real-world applications. However, existing OOD detection approaches primarily rely on the feature maps or the full gradient space information to derive OOD scores neglecting the role of most important parameters of the pre-trained network over in-distribution (ID) data. In this study, we propose a novel approach called GradOrth to facilitate OOD detection based on one intriguing observation that the important features to identify OOD data lie in the lower-rank subspace of in-distribution (ID) data. In particular, we identify OOD data by computing the norm of gradient projection on the subspaces considered important for the in-distribution data. A large orthogonal projection value (i.e. a small projection value) indicates the sample as OOD as it captures a weak correlation of the ID data. This simple yet effective method exhibits outstanding performance, showcasing a notable reduction in the average false positive rate at a 95% true positive rate (FPR95) of up to 8% when compared to the current state-of-the-art methods.

CVMar 25, 2022
Self-supervised Semantic Segmentation Grounded in Visual Concepts

Wenbin He, William Surmeier, Arvind Kumar Shekar et al.

Unsupervised semantic segmentation requires assigning a label to every pixel without any human annotations. Despite recent advances in self-supervised representation learning for individual images, unsupervised semantic segmentation with pixel-level representations is still a challenging task and remains underexplored. In this work, we propose a self-supervised pixel representation learning method for semantic segmentation by using visual concepts (i.e., groups of pixels with semantic meanings, such as parts, objects, and scenes) extracted from images. To guide self-supervised learning, we leverage three types of relationships between pixels and concepts, including the relationships between pixels and local concepts, local and global concepts, as well as the co-occurrence of concepts. We evaluate the learned pixel embeddings and visual concepts on three datasets, including PASCAL VOC 2012, COCO 2017, and DAVIS 2017. Our results show that the proposed method gains consistent and substantial improvements over recent unsupervised semantic segmentation approaches, and also demonstrate that visual concepts can reveal insights into image datasets.

CVJul 20, 2023
UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models

Xin Li, Sima Behpour, Thang Doan et al.

In this study, we investigate the task of data pre-selection, which aims to select instances for labeling from an unlabeled dataset through a single pass, thereby optimizing performance for undefined downstream tasks with a limited annotation budget. Previous approaches to data pre-selection relied solely on visual features extracted from foundation models, such as CLIP and BLIP-2, but largely ignored the powerfulness of text features. In this work, we argue that, with proper design, the joint feature space of both vision and text can yield a better representation for data pre-selection. To this end, we introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models, like BLIP-2, for data pre-selection. Specifically, with the BLIP-2 parameters frozen, we train text prompts to extract the joint features with improved representation, ensuring a diverse cluster structure that covers the entire dataset. We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%. Interestingly, the prompts learned from one dataset demonstrate significant generalizability and can be applied directly to enhance the feature extraction of BLIP-2 from other datasets. To the best of our knowledge, UP-DP is the first work to incorporate unsupervised prompt learning in a vision-language model for data pre-selection.

AINov 6, 2023
InterVLS: Interactive Model Understanding and Improvement with Vision-Language Surrogates

Jinbin Huang, Wenbin He, Liang Gou et al.

Deep learning models are widely used in critical applications, highlighting the need for pre-deployment model understanding and improvement. Visual concept-based methods, while increasingly used for this purpose, face challenges: (1) most concepts lack interpretability, (2) existing methods require model knowledge, often unavailable at run time. Additionally, (3) there lacks a no-code method for post-understanding model improvement. Addressing these, we present InterVLS. The system facilitates model understanding by discovering text-aligned concepts, measuring their influence with model-agnostic linear surrogates. Employing visual analytics, InterVLS offers concept-based explanations and performance insights. It enables users to adjust concept influences to update a model, facilitating no-code model improvement. We evaluate InterVLS in a user study, illustrating its functionality with two scenarios. Results indicates that InterVLS is effective to help users identify influential concepts to a model, gain insights and adjust concept influence to improve the model. We conclude with a discussion based on our study results.

CLNov 28, 2024
MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

Saptarshi Sengupta, Harsh Vashistha, Kristal Curtis et al.

Extending the capabilities of Large Language Models (LLMs) with functions or tools for environment interaction has led to the emergence of the agent paradigm. In industry, training an LLM is not always feasible because of the scarcity of domain data, legal holds on proprietary customer data, rapidly changing business requirements, and the need to prototype new assistants. Agents provide an elegant solution to the above by relying on the zero-shot reasoning abilities of the underlying LLM and utilizing tools to explore and reason over customer data and respond to user requests. However, there are two concerns here: (I) acquiring large scale customer queries for agent testing is time-consuming, and (II) high reliance on the tool call sequence (or trajectory) followed by the agent to respond to user queries may lead to unexpected or incorrect behavior. To address this, we propose MAG-V, a multi-agent framework to first generate a dataset of questions that mimic customer queries; and second, reverse-engineer alternate questions from the responses for trajectory verification. Initial results indicate that our synthetic data can improve agent performance on actual customer queries. Furthermore, our trajectory verification methodology, inspired by distant supervision and using traditional machine learning (ML) models, outperforms a GPT-4o judge baseline by 11% accuracy and matches the performance of a GPT-4 judge on our constructed dataset. Overall, our approach is a step towards unifying diverse task agents into a cohesive framework for achieving an aligned objective.

CVJan 12, 2024
AttributionScanner: A Visual Analytics System for Model Validation with Metadata-Free Slice Finding

Xiwei Xuan, Jorge Piazentin Ono, Liang Gou et al.

Data slice finding is an emerging technique for validating machine learning (ML) models by identifying and analyzing subgroups in a dataset that exhibit poor performance, often characterized by distinct feature sets or descriptive metadata. However, in the context of validating vision models involving unstructured image data, this approach faces significant challenges, including the laborious and costly requirement for additional metadata and the complex task of interpreting the root causes of underperformance. To address these challenges, we introduce AttributionScanner, an innovative human-in-the-loop Visual Analytics (VA) system, designed for metadata-free data slice finding. Our system identifies interpretable data slices that involve common model behaviors and visualizes these patterns through an Attribution Mosaic design. Our interactive interface provides straightforward guidance for users to detect, interpret, and annotate predominant model issues, such as spurious correlations (model biases) and mislabeled data, with minimal effort. Additionally, it employs a cutting-edge model regularization technique to mitigate the detected issues and enhance the model's performance. The efficacy of AttributionScanner is demonstrated through use cases involving two benchmark datasets, with qualitative and quantitative evaluations showcasing its substantial effectiveness in vision model validation, ultimately leading to more reliable and accurate models.

CVJul 11, 2025
VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels

Xiwei Xuan, Xiaoqi Wang, Wenbin He et al.

The advances in multi-modal foundation models (FMs) (e.g., CLIP and LLaVA) have facilitated the auto-labeling of large-scale datasets, enhancing model performance in challenging downstream tasks such as open-vocabulary object detection and segmentation. However, the quality of FM-generated labels is less studied as existing approaches focus more on data quantity over quality. This is because validating large volumes of data without ground truth presents a considerable challenge in practice. Existing methods typically rely on limited metrics to identify problematic data, lacking a comprehensive perspective, or apply human validation to only a small data fraction, failing to address the full spectrum of potential issues. To overcome these challenges, we introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models. Targeting the complex and demanding domain of open-vocabulary image segmentation, VISTA integrates multi-phased data validation strategies with human expertise, enabling humans to identify, understand, and correct hidden issues within FM-generated labels. Through detailed use cases on two benchmark datasets and expert reviews, we demonstrate VISTA's effectiveness from both quantitative and qualitative perspectives.

CVMay 6, 2025
VISLIX: An XAI Framework for Validating Vision Models with Slice Discovery and Analysis

Xinyuan Yan, Xiwei Xuan, Jorge Piazentin Ono et al.

Real-world machine learning models require rigorous evaluation before deployment, especially in safety-critical domains like autonomous driving and surveillance. The evaluation of machine learning models often focuses on data slices, which are subsets of the data that share a set of characteristics. Data slice finding automatically identifies conditions or data subgroups where models underperform, aiding developers in mitigating performance issues. Despite its popularity and effectiveness, data slicing for vision model validation faces several challenges. First, data slicing often needs additional image metadata or visual concepts, and falls short in certain computer vision tasks, such as object detection. Second, understanding data slices is a labor-intensive and mentally demanding process that heavily relies on the expert's domain knowledge. Third, data slicing lacks a human-in-the-loop solution that allows experts to form hypothesis and test them interactively. To overcome these limitations and better support the machine learning operations lifecycle, we introduce VISLIX, a novel visual analytics framework that employs state-of-the-art foundation models to help domain experts analyze slices in computer vision models. Our approach does not require image metadata or visual concepts, automatically generates natural language insights, and allows users to test data slice hypothesis interactively. We evaluate VISLIX with an expert study and three use cases, that demonstrate the effectiveness of our tool in providing comprehensive insights for validating object detection models.

CVMar 10, 2024
A streamlined Approach to Multimodal Few-Shot Class Incremental Learning for Fine-Grained Datasets

Thang Doan, Sima Behpour, Xin Li et al.

Few-shot Class-Incremental Learning (FSCIL) poses the challenge of retaining prior knowledge while learning from limited new data streams, all without overfitting. The rise of Vision-Language models (VLMs) has unlocked numerous applications, leveraging their existing knowledge to fine-tune on custom data. However, training the whole model is computationally prohibitive, and VLMs while being versatile in general domains still struggle with fine-grained datasets crucial for many applications. We tackle these challenges with two proposed simple modules. The first, Session-Specific Prompts (SSP), enhances the separability of image-text embeddings across sessions. The second, Hyperbolic distance, compresses representations of image-text pairs within the same class while expanding those from different classes, leading to better representations. Experimental results demonstrate an average 10-point increase compared to baselines while requiring at least 8 times fewer trainable parameters. This improvement is further underscored on our three newly introduced fine-grained datasets.

LGNov 25, 2025
Cisco Time Series Model Technical Report

Liang Gou, Archit Khare, Praneet Pabolu et al.

We introduce the Cisco Time Series Model, a univariate zero-shot forecaster. This time series foundation model is the result of a general architectural innovation to a time series model enabling it to accept multiresolution input, applied to a popular decoder-only time series model (TimesFM). The resulting multiresolution decoder-only model is trained on over 300B unique data points, with more than half coming from the observability domain. Quantitative and qualitative evaluations demonstrate that the resulting model achieves superior performance on observability datasets while retaining very similar performance on a standard general-purpose forecasting benchmark (GIFT-Eval), and suggest that the multiresolution structure enables the model to make more accurate predictions on long context input.

LGJun 25, 2024
InFiConD: Interactive No-code Fine-tuning with Concept-based Knowledge Distillation

Jinbin Huang, Wenbin He, Liang Gou et al.

The emergence of large-scale pre-trained models has heightened their application in various downstream tasks, yet deployment is a challenge in environments with limited computational resources. Knowledge distillation has emerged as a solution in such scenarios, whereby knowledge from large teacher models is transferred into smaller student' models, but this is a non-trivial process that traditionally requires technical expertise in AI/ML. To address these challenges, this paper presents InFiConD, a novel framework that leverages visual concepts to implement the knowledge distillation process and enable subsequent no-code fine-tuning of student models. We develop a novel knowledge distillation pipeline based on extracting text-aligned visual concepts from a concept corpus using multimodal models, and construct highly interpretable linear student models based on visual concepts that mimic a teacher model in a response-based manner. InFiConD's interface allows users to interactively fine-tune the student model by manipulating concept influences directly in the user interface. We validate InFiConD via a robust usage scenario and user study. Our findings indicate that InFiConD's human-in-the-loop and visualization-driven approach enables users to effectively create and analyze student models, understand how knowledge is transferred, and efficiently perform fine-tuning operations. We discuss how this work highlights the potential of interactive and visual methods in making knowledge distillation and subsequent no-code fine-tuning more accessible and adaptable to a wider range of users with domain-specific demands.

CVJun 7, 2024
USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

Xiaoqi Wang, Wenbin He, Xiwei Xuan et al.

The open-vocabulary image segmentation task involves partitioning images into semantically meaningful segments and classifying them with flexible text-defined categories. The recent vision-based foundation models such as the Segment Anything Model (SAM) have shown superior performance in generating class-agnostic image segments. The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories. In this paper, we introduce the Universal Segment Embedding (USE) framework to address this challenge. This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities, and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories. The USE model can not only help open-vocabulary image segmentation but also facilitate other downstream tasks (e.g., querying and ranking). Through comprehensive experimental studies on semantic segmentation and part segmentation benchmarks, we demonstrate that the USE framework outperforms state-of-the-art open-vocabulary segmentation methods.

CVMay 1, 2023
CLIP-S$^4$: Language-Guided Self-Supervised Semantic Segmentation

Wenbin He, Suphanut Jamonnak, Liang Gou et al.

Existing semantic segmentation approaches are often limited by costly pixel-wise annotations and predefined classes. In this work, we present CLIP-S$^4$ that leverages self-supervised pixel representation learning and vision-language models to enable various semantic segmentation tasks (e.g., unsupervised, transfer learning, language-driven segmentation) without any human annotations and unknown class information. We first learn pixel embeddings with pixel-segment contrastive learning from different augmented views of images. To further improve the pixel embeddings and enable language-driven semantic segmentation, we design two types of consistency guided by vision-language models: 1) embedding consistency, aligning our pixel embeddings to the joint feature space of a pre-trained vision-language model, CLIP; and 2) semantic consistency, forcing our model to make the same predictions as CLIP over a set of carefully designed target classes with both known and unknown prototypes. Thus, CLIP-S$^4$ enables a new task of class-free semantic segmentation where no unknown class information is needed during training. As a result, our approach shows consistent and substantial performance improvement over four popular benchmarks compared with the state-of-the-art unsupervised and language-driven semantic segmentation methods. More importantly, our method outperforms these methods on unknown class recognition by a large margin.

CVJan 3, 2022
Novelty-based Generalization Evaluation for Traffic Light Detection

Arvind Kumar Shekar, Laureen Lake, Liang Gou et al.

The advent of Convolutional Neural Networks (CNNs) has led to their application in several domains. One noteworthy application is the perception system for autonomous driving that relies on the predictions from CNNs. Practitioners evaluate the generalization ability of such CNNs by calculating various metrics on an independent test dataset. A test dataset is often chosen based on only one precondition, i.e., its elements are not a part of the training data. Such a dataset may contain objects that are both similar and novel w.r.t. the training dataset. Nevertheless, existing works do not reckon the novelty of the test samples and treat them all equally for evaluating generalization. Such novelty-based evaluations are of significance to validate the fitness of a CNN in autonomous driving applications. Hence, we propose a CNN generalization scoring framework that considers novelty of objects in the test dataset. We begin with the representation learning technique to reduce the image data into a low-dimensional space. It is on this space we estimate the novelty of the test samples. Finally, we calculate the generalization score as a combination of the test data prediction performance and novelty. We perform an experimental study of the same for our traffic light detection application. In addition, we systematically visualize the results for an interpretable notion of novelty.

HCSep 27, 2020
VATLD: A Visual Analytics System to Assess, Understand and Improve Traffic Light Detection

Liang Gou, Lincan Zou, Nanxiang Li et al.

Traffic light detection is crucial for environment perception and decision-making in autonomous driving. State-of-the-art detectors are built upon deep Convolutional Neural Networks (CNNs) and have exhibited promising performance. However, one looming concern with CNN based detectors is how to thoroughly evaluate the performance of accuracy and robustness before they can be deployed to autonomous vehicles. In this work, we propose a visual analytics system, VATLD, equipped with a disentangled representation learning and semantic adversarial learning, to assess, understand, and improve the accuracy and robustness of traffic light detectors in autonomous driving applications. The disentangled representation learning extracts data semantics to augment human cognition with human-friendly visual summarization, and the semantic adversarial learning efficiently exposes interpretable robustness risks and enables minimal human interaction for actionable insights. We also demonstrate the effectiveness of various performance improvement strategies derived from actionable insights with our visual analytics system, VATLD, and illustrate some practical implications for safety-critical applications in autonomous driving.

LGSep 23, 2020
Towards a Flexible Embedding Learning Framework

Chin-Chia Michael Yeh, Dhruv Gelda, Zhongfang Zhuang et al.

Representation learning is a fundamental building block for analyzing entities in a database. While the existing embedding learning methods are effective in various data mining problems, their applicability is often limited because these methods have pre-determined assumptions on the type of semantics captured by the learned embeddings, and the assumptions may not well align with specific downstream tasks. In this work, we propose an embedding learning framework that 1) uses an input format that is agnostic to input data type, 2) is flexible in terms of the relationships that can be embedded into the learned representations, and 3) provides an intuitive pathway to incorporate domain knowledge into the embedding learning process. Our proposed framework utilizes a set of entity-relation-matrices as the input, which quantifies the affinities among different entities in the database. Moreover, a sampling mechanism is carefully designed to establish a direct connection between the input and the information captured by the output embeddings. To complete the representation learning toolbox, we also outline a simple yet effective post-processing technique to properly visualize the learned embeddings. Our empirical results demonstrate that the proposed framework, in conjunction with a set of relevant entity-relation-matrices, outperforms the existing state-of-the-art approaches in various data mining tasks.

LGDec 22, 2018
Dynamic Graph Representation Learning via Self-Attention Networks

Aravind Sankar, Yanhong Wu, Liang Gou et al.

Learning latent representations of nodes in graphs is an important and ubiquitous task with widespread applications such as link prediction, node classification, and graph visualization. Previous methods on graph representation learning mainly focus on static graphs, however, many real-world graphs are dynamic and evolve over time. In this paper, we present Dynamic Self-Attention Network (DySAT), a novel neural architecture that operates on dynamic graphs and learns node representations that capture both structural properties and temporal evolutionary patterns. Specifically, DySAT computes node representations by jointly employing self-attention layers along two dimensions: structural neighborhood and temporal dynamics. We conduct link prediction experiments on two classes of graphs: communication networks and bipartite rating networks. Our experimental results show that DySAT has a significant performance gain over several different state-of-the-art graph embedding baselines.

HCOct 11, 2017
Raising Awareness of Conveyed Personality In Social Media Traces

Bin Xu, Liang Gou, Anbang Xu et al.

Users' persistent social media contents like posts on Facebook Timeline are presented as an "exhibition" about the person to others, and managing these exhibitional contents for impression management needs intentional and manual efforts. To raise awareness of and facilitate impression management around past contents, we developed a prototype called PersonalityInsight. The system employs computational psycho-linguistic analysis to help users visualize the way their past text posts might convey impressions of their personality and allowed users to modify their posts based on these visualizations. We conducted a user study to evaluate the design; users overall found that such a tool raised awareness of the fact and the ways personality might be conveyed through their past content as one aspect of impression management, but that it needs design improvement to offer action-able suggestions for content modification, as well as careful thinking about impression management as one of many values people have about their digital past.