Ziming Zhang

CV
h-index21
65papers
2,173citations
Novelty55%
AI Score60

65 Papers

CVJul 20, 2022Code
Robust Object Detection With Inaccurate Bounding Boxes

Chengxin Liu, Kewei Wang, Hao Lu et al.

Learning accurate object detectors often requires large-scale training data with precise object bounding boxes. However, labeling such data is expensive and time-consuming. As the crowd-sourcing labeling process and the ambiguities of the objects may raise noisy bounding box annotations, the object detectors will suffer from the degenerated training data. In this work, we aim to address the challenge of learning robust object detectors with inaccurate bounding boxes. Inspired by the fact that localization precision suffers significantly from inaccurate bounding boxes while classification accuracy is less affected, we propose leveraging classification as a guidance signal for refining localization results. Specifically, by treating an object as a bag of instances, we introduce an Object-Aware Multiple Instance Learning approach (OA-MIL), featured with object-aware instance selection and object-aware instance extension. The former aims to select accurate instances for training, instead of directly using inaccurate box annotations. The latter focuses on generating high-quality instances for selection. Extensive experiments on synthetic noisy datasets (i.e., noisy PASCAL VOC and MS-COCO) and a real noisy wheat head dataset demonstrate the effectiveness of our OA-MIL. Code is available at https://github.com/cxliu0/OA-MIL.

CVFeb 5, 2023Code
Self-supervised Geometric Features Discovery via Interpretable Attentio for Vehicle Re-Identification and Beyond (Complete Version)

Ming Li, Xinming Huang, Ziming Zhang

To learn distinguishable patterns, most of recent works in vehicle re-identification (ReID) struggled to redevelop official benchmarks to provide various supervisions, which requires prohibitive human labors. In this paper, we seek to achieve the similar goal but do not involve more human efforts. To this end, we introduce a novel framework, which successfully encodes both geometric local features and global representations to distinguish vehicle instances, optimized only by the supervision from official ID labels. Specifically, given our insight that objects in ReID share similar geometric characteristics, we propose to borrow self-supervised representation learning to facilitate geometric features discovery. To condense these features, we introduce an interpretable attention module, with the core of local maxima aggregation instead of fully automatic learning, whose mechanism is completely understandable and whose response map is physically reasonable. To the best of our knowledge, we are the first that perform self-supervised learning to discover geometric features. We conduct comprehensive experiments on three most popular datasets for vehicle ReID, i.e., VeRi-776, CityFlow-ReID, and VehicleID. We report our state-of-the-art (SOTA) performances and promising visualization results. We also show the excellent scalability of our approach on other ReID related tasks, i.e., person ReID and multi-target multi-camera (MTMC) vehicle tracking. The code is available at https://github.com/ ming1993li/Self-supervised-Geometric.

94.3CVMay 30Code
CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Fangzhou Lin, Peiran Li, Lingyu Xu et al.

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.

CVMar 29, 2022Code
Robust Structured Declarative Classifiers for 3D Point Clouds: Defending Adversarial Attacks with Implicit Gradients

Kaidong Li, Ziming Zhang, Cuncong Zhong et al.

Deep neural networks for 3D point cloud classification, such as PointNet, have been demonstrated to be vulnerable to adversarial attacks. Current adversarial defenders often learn to denoise the (attacked) point clouds by reconstruction, and then feed them to the classifiers as input. In contrast to the literature, we propose a family of robust structured declarative classifiers for point cloud classification, where the internal constrained optimization mechanism can effectively defend adversarial attacks through implicit gradients. Such classifiers can be formulated using a bilevel optimization framework. We further propose an effective and efficient instantiation of our approach, namely, Lattice Point Classifier (LPC), based on structured sparse coding in the permutohedral lattice and 2D convolutional neural networks (CNNs) that is end-to-end trainable. We demonstrate state-of-the-art robust point cloud classification performance on ModelNet40 and ScanNet under seven different attackers. For instance, we achieve 89.51% and 83.16% test accuracy on each dataset under the recent JGBA attacker that outperforms DUP-Net and IF-Defense with PointNet by ~70%. Demo code is available at https://zhang-vislab.github.io.

99.7CLApr 5Code
AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference

Fangzhou Lin, Peiran Li, Shuo Xing et al.

Large language models struggle to accumulate evidence across multiple rounds of user interaction, failing to update their beliefs in a manner consistent with Bayesian inference. Existing solutions require fine-tuning on sensitive user interaction data, limiting their applicability in privacy-conscious settings. We propose AdaptFuse, a training-free framework that externalizes probabilistic computation entirely from the LLM: a symbolic module maintains a Bayesian posterior over a discrete hypothesis set, while a frozen LLM contributes semantic reasoning via multi-sample Dirichlet aggregation. The two signals are combined through entropy-adaptive fusion, which automatically weights each source by its predictive confidence, shifting reliance from the LLM to the symbolic posterior as evidence accumulates. We evaluate across three domains: flight recommendation, hotel recommendation, and web shopping; on Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B. AdaptFuse consistently outperforms both prompting baselines and fine-tuned Bayesian Teaching models on all tasks, with accuracy improving monotonically over interaction rounds. These results demonstrate that principled inference-time algorithms can substitute for fine-tuning in personalized recommendation, without storing or training on sensitive user data. All the code and materials will be open-sourced.

CVSep 10, 2024Code
Loss Distillation via Gradient Matching for Point Cloud Completion with Weighted Chamfer Distance

Fangzhou Lin, Haotian Liu, Haoying Zhou et al.

3D point clouds enhanced the robot's ability to perceive the geometrical information of the environments, making it possible for many downstream tasks such as grasp pose detection and scene understanding. The performance of these tasks, though, heavily relies on the quality of data input, as incomplete can lead to poor results and failure cases. Recent training loss functions designed for deep learning-based point cloud completion, such as Chamfer distance (CD) and its variants (\eg HyperCD ), imply a good gradient weighting scheme can significantly boost performance. However, these CD-based loss functions usually require data-related parameter tuning, which can be time-consuming for data-extensive tasks. To address this issue, we aim to find a family of weighted training losses ({\em weighted CD}) that requires no parameter tuning. To this end, we propose a search scheme, {\em Loss Distillation via Gradient Matching}, to find good candidate loss functions by mimicking the learning behavior in backpropagation between HyperCD and weighted CD. Once this is done, we propose a novel bilevel optimization formula to train the backbone network based on the weighted CD loss. We observe that: (1) with proper weighted functions, the weighted CD can always achieve similar performance to HyperCD, and (2) the Landau weighted CD, namely {\em Landau CD}, can outperform HyperCD for point cloud completion and lead to new state-of-the-art results on several benchmark datasets. {\it Our demo code is available at \url{https://github.com/Zhang-VISLab/IROS2024-LossDistillationWeightedCD}.}

CVFeb 2, 2023
Hyperbolic Contrastive Learning

Yun Yue, Fangzhou Lin, Kazunori D Yamada et al.

Learning good image representations that are beneficial to downstream tasks is a challenging task in computer vision. As such, a wide variety of self-supervised learning approaches have been proposed. Among them, contrastive learning has shown competitive performance on several benchmark datasets. The embeddings of contrastive learning are arranged on a hypersphere that results in using the inner (dot) product as a distance measurement in Euclidean space. However, the underlying structure of many scientific fields like social networks, brain imaging, and computer graphics data exhibit highly non-Euclidean latent geometry. We propose a novel contrastive learning framework to learn semantic relationships in the hyperbolic space. Hyperbolic space is a continuous version of trees that naturally owns the ability to model hierarchical structures and is thus beneficial for efficient contrastive representation learning. We also extend the proposed Hyperbolic Contrastive Learning (HCL) to the supervised domain and studied the adversarial robustness of HCL. The comprehensive experiments show that our proposed method achieves better results on self-supervised pretraining, supervised classification, and higher robust accuracy than baseline methods.

RONov 29, 2022
Simultaneous Estimation of Hand Configurations and Finger Joint Angles using Forearm Ultrasound

Keshav Bimbraw, Christopher J. Nycz, Matt Schueler et al.

With the advancement in computing and robotics, it is necessary to develop fluent and intuitive methods for interacting with digital systems, augmented/virtual reality (AR/VR) interfaces, and physical robotic systems. Hand motion recognition is widely used to enable these interactions. Hand configuration classification and MCP joint angle detection is important for a comprehensive reconstruction of hand motion. sEMG and other technologies have been used for the detection of hand motions. Forearm ultrasound images provide a musculoskeletal visualization that can be used to understand hand motion. Recent work has shown that these ultrasound images can be classified using machine learning to estimate discrete hand configurations. Estimating both hand configuration and MCP joint angles based on forearm ultrasound has not been addressed in the literature. In this paper, we propose a CNN based deep learning pipeline for predicting the MCP joint angles. The results for the hand configuration classification were compared by using different machine learning algorithms. SVC with different kernels, MLP, and the proposed CNN have been used to classify the ultrasound images into 11 hand configurations based on activities of daily living. Forearm ultrasound images were acquired from 6 subjects instructed to move their hands according to predefined hand configurations. Motion capture data was acquired to get the finger angles corresponding to the hand movements at different speeds. Average classification accuracy of 82.7% for the proposed CNN and over 80% for SVC for different kernels was observed on a subset of the dataset. An average RMSE of 7.35 degrees was obtained between the predicted and the true MCP joint angles. A low latency (6.25 - 9.1 Hz) pipeline has been proposed for estimating both MCP joint angles and hand configuration aimed at real-time control of human-machine interfaces.

LGJun 22, 2022
Auto-Encoding Adversarial Imitation Learning

Kaifeng Zhang, Rui Zhao, Ziming Zhang et al.

Reinforcement learning (RL) provides a powerful framework for decision-making, but its application in practice often requires a carefully designed reward function. Adversarial Imitation Learning (AIL) sheds light on automatic policy acquisition without access to the reward signal from the environment. In this work, we propose Auto-Encoding Adversarial Imitation Learning (AEAIL), a robust and scalable AIL framework. To induce expert policies from demonstrations, AEAIL utilizes the reconstruction error of an auto-encoder as a reward signal, which provides more information for optimizing policies than the prior discriminator-based ones. Subsequently, we use the derived objective functions to train the auto-encoder and the agent policy. Experiments show that our AEAIL performs superior compared to state-of-the-art methods on both state and image based environments. More importantly, AEAIL shows much better robustness when the expert demonstrations are noisy.

CVMar 21, 2023
PRISE: Demystifying Deep Lucas-Kanade with Strongly Star-Convex Constraints for Multimodel Image Alignment

Yiqing Zhang, Xinming Huang, Ziming Zhang

The Lucas-Kanade (LK) method is a classic iterative homography estimation algorithm for image alignment, but often suffers from poor local optimality especially when image pairs have large distortions. To address this challenge, in this paper we propose a novel Deep Star-Convexified Lucas-Kanade (PRISE) method for multimodel image alignment by introducing strongly star-convex constraints into the optimization problem. Our basic idea is to enforce the neural network to approximately learn a star-convex loss landscape around the ground truth give any data to facilitate the convergence of the LK method to the ground truth through the high dimensional space defined by the network. This leads to a minimax learning problem, with contrastive (hinge) losses due to the definition of strong star-convexity that are appended to the original loss for training. We also provide an efficient sampling based algorithm to leverage the training cost, as well as some analysis on the quality of the solutions from PRISE. We further evaluate our approach on benchmark datasets such as MSCOCO, GoogleEarth, and GoogleMap, and demonstrate state-of-the-art results, especially for small pixel errors. Code can be downloaded from https://github.com/Zhang-VISLab.

CVSep 27, 2022
EgoSpeed-Net: Forecasting Speed-Control in Driver Behavior from Egocentric Video Data

Yichen Ding, Ziming Zhang, Yanhua Li et al.

Speed-control forecasting, a challenging problem in driver behavior analysis, aims to predict the future actions of a driver in controlling vehicle speed such as braking or acceleration. In this paper, we try to address this challenge solely using egocentric video data, in contrast to the majority of works in the literature using either third-person view data or extra vehicle sensor data such as GPS, or both. To this end, we propose a novel graph convolutional network (GCN) based network, namely, EgoSpeed-Net. We are motivated by the fact that the position changes of objects over time can provide us very useful clues for forecasting the speed change in future. We first model the spatial relations among the objects from each class, frame by frame, using fully-connected graphs, on top of which GCNs are applied for feature extraction. Then we utilize a long short-term memory network to fuse such features per class over time into a vector, concatenate such vectors and forecast a speed-control action using a multilayer perceptron classifier. We conduct extensive experiments on the Honda Research Institute Driving Dataset and demonstrate the superior performance of EgoSpeed-Net.

52.4CVMar 13
NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks

Fangzhou Lin, Yuping Wang, Yuliang Guo et al.

Partially Supervised Multi-Task Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expressive capacity. We first evaluate NexusFlow on the core challenge of domain-partitioned autonomous driving, where dense map reconstruction and sparse multi-object tracking are supervised in different geographic regions, creating both structural disparity and a strong domain gap. NexusFlow sets a new state-of-the-art result on nuScenes, outperforming strong partially supervised baselines. To demonstrate generality, we further test NexusFlow on NYUv2 using three homogeneous dense prediction tasks, segmentation, depth, and surface normals, as a representative N-task PS-MTL scenario. NexusFlow yields consistent gains across all tasks, confirming its broad applicability.

CVApr 23, 2024Code
Understanding Hyperbolic Metric Learning through Hard Negative Sampling

Yun Yue, Fangzhou Lin, Guanyi Mou et al.

In recent years, there has been a growing trend of incorporating hyperbolic geometry methods into computer vision. While these methods have achieved state-of-the-art performance on various metric learning tasks using hyperbolic distance measurements, the underlying theoretical analysis supporting this superior performance remains under-exploited. In this study, we investigate the effects of integrating hyperbolic space into metric learning, particularly when training with contrastive loss. We identify a need for a comprehensive comparison between Euclidean and hyperbolic spaces regarding the temperature effect in the contrastive loss within the existing literature. To address this gap, we conduct an extensive investigation to benchmark the results of Vision Transformers (ViTs) using a hybrid objective function that combines loss from Euclidean and hyperbolic spaces. Additionally, we provide a theoretical analysis of the observed performance improvement. We also reveal that hyperbolic metric learning is highly related to hard negative sampling, providing insights for future work. This work will provide valuable data points and experience in understanding hyperbolic image embeddings. To shed more light on problem-solving and encourage further investigation into our approach, our code is available online (https://github.com/YunYunY/HypMix).

88.2AIMay 15
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

Fangzhou Lin, Shuo Xing, Peiran Li et al.

Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier's accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.

84.8CRApr 28
R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models

Ziming Zhang, Li Li, Guorui Feng et al.

Large language models (LLMs) are widely deployed in multiple scenarios due to reasoning capabilities. In order to prevent the models from being misused, watermarking is generally employed to ensure ownership. However, most existing watermarking methods rely on superficial modifications to the model's output distribution, rendering the watermark vulnerable to perturbation and removal. To overcome this challenge, this paper introduces a reasoning-layer framework termed Redundant Chain-of-Thought (R-CoT), which embeds watermarks into the reasoning path. A dual-trajectory optimization mechanism based on GRPO enables the native and the watermark reasoning path to coexist within a shared parameter space, internalizing the watermark as a distinct reasoning policy. Therefore, the watermark is embedded into the model's stable reasoning path, avoiding the watermark failure caused by output-level perturbations. Experimental results show that, compared with existing methods, R-CoT achieves high watermark effectiveness and strong robustness. Under fine-tuning and other post-training operations, the true positive rate (TPR) consistently remains above 95%, exhibiting only marginal degradation.

CVOct 13, 2024Code
Robust 3D Point Clouds Classification based on Declarative Defenders

Kaidong Li, Tianxiao Zhang, Cuncong Zhong et al.

3D point cloud classification requires distinct models from 2D image classification due to the divergent characteristics of the respective input data. While 3D point clouds are unstructured and sparse, 2D images are structured and dense. Bridging the domain gap between these two data types is a non-trivial challenge to enable model interchangeability. Recent research using Lattice Point Classifier (LPC) highlights the feasibility of cross-domain applicability. However, the lattice projection operation in LPC generates 2D images with disconnected projected pixels. In this paper, we explore three distinct algorithms for mapping 3D point clouds into 2D images. Through extensive experiments, we thoroughly examine and analyze their performance and defense mechanisms. Leveraging current large foundation models, we scrutinize the feature disparities between regular 2D images and projected 2D images. The proposed approaches demonstrate superior accuracy and robustness against adversarial attacks. The generative model-based mapping algorithms yield regular 2D images, further minimizing the domain gap from regular 2D classification tasks. The source code is available at https://github.com/KaidongLi/pytorch-LatticePointClassifier.git.

SISep 25, 2024
Wildlife Product Trading in Online Social Networks: A Case Study on Ivory-Related Product Sales Promotion Posts

Guanyi Mou, Yun Yue, Kyumin Lee et al.

Wildlife trafficking (WLT) has emerged as a global issue, with traffickers expanding their operations from offline to online platforms, utilizing e-commerce websites and social networks to enhance their illicit trade. This paper addresses the challenge of detecting and recognizing wildlife product sales promotion behaviors in online social networks, a crucial aspect in combating these environmentally harmful activities. To counter these environmentally damaging illegal operations, in this research, we focus on wildlife product sales promotion behaviors in online social networks. Specifically, 1) A scalable dataset related to wildlife product trading is collected using a network-based approach. This dataset is labeled through a human-in-the-loop machine learning process, distinguishing positive class samples containing wildlife product selling posts and hard-negatives representing normal posts misclassified as potential WLT posts, subsequently corrected by human annotators. 2) We benchmark the machine learning results on the proposed dataset and build a practical framework that automatically identifies suspicious wildlife selling posts and accounts, sufficiently leveraging the multi-modal nature of online social networks. 3) This research delves into an in-depth analysis of trading posts, shedding light on the systematic and organized selling behaviors prevalent in the current landscape. We provide detailed insights into the nature of these behaviors, contributing valuable information for understanding and countering illegal wildlife product trading.

LGAug 3, 2025Code
KANMixer: Can KAN Serve as a New Modeling Core for Long-term Time Series Forecasting?

Lingyu Jiang, Yuping Wang, Yao Su et al.

In recent years, multilayer perceptrons (MLP)-based deep learning models have demonstrated remarkable success in long-term time series forecasting (LTSF). Existing approaches typically augment MLP backbones with hand-crafted external modules to address the inherent limitations of their flat architectures. Despite their success, these augmented methods neglect hierarchical locality and sequential inductive biases essential for time-series modeling, and recent studies indicate diminishing performance improvements. To overcome these limitations, we explore Kolmogorov-Arnold Networks (KAN), a recently proposed model featuring adaptive basis functions capable of granular, local modulation of nonlinearities. This raises a fundamental question: Can KAN serve as a new modeling core for LTSF? To answer this, we introduce KANMixer, a concise architecture integrating a multi-scale mixing backbone that fully leverages KAN's adaptive capabilities. Extensive evaluation demonstrates that KANMixer achieves state-of-the-art performance in 16 out of 28 experiments across seven benchmark datasets. To uncover the reasons behind this strong performance, we systematically analyze the strengths and limitations of KANMixer in comparison with traditional MLP architectures. Our findings reveal that the adaptive flexibility of KAN's learnable basis functions significantly transforms the influence of network structural prior on forecasting performance. Furthermore, we identify critical design factors affecting forecasting accuracy and offer practical insights for effectively utilizing KAN in LTSF. Together, these insights constitute the first empirically grounded guidelines for effectively leveraging KAN in LTSF. Code is available in the supplementary file.

CVApr 5, 2021Code
Training Deep Neural Networks via Branch-and-Bound

Yuanwei Wu, Ziming Zhang, Guanghui Wang

In this paper, we propose BPGrad, a novel approximate algorithm for deep nueral network training, based on adaptive estimates of feasible region via branch-and-bound. The method is based on the assumption of Lipschitz continuity in objective function, and as a result, it can adaptively determine the step size for the current gradient given the history of previous updates. We prove that, by repeating such a branch-and-pruning procedure, it can achieve the optimal solution within finite iterations. A computationally efficient solver based on BPGrad has been proposed to train the deep neural networks. Empirical results demonstrate that BPGrad solver works well in practice and compares favorably to other stochastic optimization methods in the tasks of object recognition, detection, and segmentation. The code is available at \url{https://github.com/RyanCV/BPGrad}.

CVMar 12, 2020Code
Learning to Segment 3D Point Clouds in 2D Image Space

Yecheng Lyu, Xinming Huang, Ziming Zhang

In contrast to the literature where local patterns in 3D point clouds are captured by customized convolutional operators, in this paper we study the problem of how to effectively and efficiently project such point clouds into a 2D image space so that traditional 2D convolutional neural networks (CNNs) such as U-Net can be applied for segmentation. To this end, we are motivated by graph drawing and reformulate it as an integer programming problem to learn the topology-preserving graph-to-grid mapping for each individual point cloud. To accelerate the computation in practice, we further propose a novel hierarchical approximate algorithm. With the help of the Delaunay triangulation for graph construction from point clouds and a multi-scale U-Net for segmentation, we manage to demonstrate the state-of-the-art performance on ShapeNet and PartNet, respectively, with significant improvement over the literature. Code is available at https://github.com/Zhang-VISLab.

CVDec 23, 2024
Hyperbolic Chamfer Distance for Point Cloud Completion and Beyond

Fangzhou Lin, Songlin Hou, Haotian Liu et al.

Chamfer Distance (CD) is widely used as a metric to quantify difference between two point clouds. In point cloud completion, Chamfer Distance (CD) is typically used as a loss function in deep learning frameworks. However, it is generally acknowledged within the field that Chamfer Distance (CD) is vulnerable to the presence of outliers, which can consequently lead to the convergence on suboptimal models. In divergence from the existing literature, which largely concentrates on resolving such concerns in the realm of Euclidean space, we put forth a notably uncomplicated yet potent metric specifically designed for point cloud completion tasks: {Hyperbolic Chamfer Distance (HyperCD)}. This metric conducts Chamfer Distance computations within the parameters of hyperbolic space. During the backpropagation process, HyperCD systematically allocates greater weight to matched point pairs exhibiting reduced Euclidean distances. This mechanism facilitates the preservation of accurate point pair matches while permitting the incremental adjustment of suboptimal matches, thereby contributing to enhanced point cloud completion outcomes. Moreover, measure the shape dissimilarity is not solely work for point cloud completion task, we further explore its applications in other generative related tasks, including single image reconstruction from point cloud, and upsampling. We demonstrate state-of-the-art performance on the point cloud completion benchmark datasets, PCN, ShapeNet-55, and ShapeNet-34, and show from visualization that HyperCD can significantly improve the surface smoothness, we also provide the provide experimental results beyond completion task.

CVJun 18, 2025
A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion

Fangzhou Lin, Zilin Dai, Rigved Sanku et al.

The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.

CLMar 21, 2025
A Language Anchor-Guided Method for Robust Noisy Domain Generalization

Zilin Dai, Lehong Wang, Fangzhou Lin et al.

Real-world machine learning applications often struggle with two major challenges: distribution shift and label noise. Models tend to overfit by focusing on redundant and uninformative features in the training data, which makes it hard for them to generalize to the target domain. Noisy data worsens this problem by causing further overfitting to the noise, meaning that existing methods often fail to tell the difference between true, invariant features and misleading, spurious ones. To tackle these issues, we introduce Anchor Alignment and Adaptive Weighting (A3W). This new algorithm uses sample reweighting guided by natural language processing (NLP) anchors to extract more representative features. In simple terms, A3W leverages semantic representations from natural language models as a source of domain-invariant prior knowledge. Additionally, it employs a weighted loss function that adjusts each sample's contribution based on its similarity to the corresponding NLP anchor. This adjustment makes the model more robust to noisy labels. Extensive experiments on standard benchmark datasets show that A3W consistently outperforms state-of-the-art domain generalization methods, offering significant improvements in both accuracy and robustness across different datasets and noise levels.

CVDec 16, 2024
SPADE: Spectroscopic Photoacoustic Denoising using an Analytical and Data-free Enhancement Framework

Fangzhou Lin, Shang Gao, Yichuan Tang et al.

Spectroscopic photoacoustic (sPA) imaging uses multiple wavelengths to differentiate chromophores based on their unique optical absorption spectra. This technique has been widely applied in areas such as vascular mapping, tumor detection, and therapeutic monitoring. However, sPA imaging is highly susceptible to noise, leading to poor signal-to-noise ratio (SNR) and compromised image quality. Traditional denoising techniques like frame averaging, though effective in improving SNR, can be impractical for dynamic imaging scenarios due to reduced frame rates. Advanced methods, including learning-based approaches and analytical algorithms, have demonstrated promise but often require extensive training data and parameter tuning, limiting their adaptability for real-time clinical use. In this work, we propose a sPA denoising using a tuning-free analytical and data-free enhancement (SPADE) framework for denoising sPA images. This framework integrates a data-free learning-based method with an efficient BM3D-based analytical approach while preserves spectral linearity, providing noise reduction and ensuring that functional information is maintained. The SPADE framework was validated through simulation, phantom, ex vivo, and in vivo experiments. Results demonstrated that SPADE improved SNR and preserved spectral information, outperforming conventional methods, especially in challenging imaging conditions. SPADE presents a promising solution for enhancing sPA imaging quality in clinical applications where noise reduction and spectral preservation are critical.

CVNov 16, 2024
Deep Loss Convexification for Learning Iterative Models

Ziming Zhang, Yuping Shao, Yiqing Zhang et al.

Iterative methods such as iterative closest point (ICP) for point cloud registration often suffer from bad local optimality (e.g. saddle points), due to the nature of nonconvex optimization. To address this fundamental challenge, in this paper we propose learning to form the loss landscape of a deep iterative method w.r.t. predictions at test time into a convex-like shape locally around each ground truth given data, namely Deep Loss Convexification (DLC), thanks to the overparametrization in neural networks. To this end, we formulate our learning objective based on adversarial training by manipulating the ground-truth predictions, rather than input data. In particular, we propose using star-convexity, a family of structured nonconvex functions that are unimodal on all lines that pass through a global minimizer, as our geometric constraint for reshaping loss landscapes, leading to (1) extra novel hinge losses appended to the original loss and (2) near-optimal predictions. We demonstrate the state-of-the-art performance using DLC with existing network architectures for the tasks of training recurrent neural networks (RNNs), 3D point cloud registration, and multimodel image alignment.

LGNov 23, 2025
TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series Forecasting

Lingyu Jiang, Lingyu Xu, Peiran Li et al.

Probabilistic Time-Series Forecasting (PTSF) is critical for uncertainty-aware decision making, but existing generative models, such as diffusion-based approaches, are computationally prohibitive due to expensive iterative sampling. Non-sampling frameworks like Multiple Choice Learning (MCL) offer an efficient alternative, but suffer from severe training instability and hypothesis collapse, which has historically hindered their performance. This problem is dramatically exacerbated when attempting to combine them with modern, efficient MLP-based backbones. To resolve this fundamental incompatibility, we propose TimePre, a novel framework that successfully unifies the efficiency of MLP-based models with the distributional flexibility of the MCL paradigm. The core of our solution is Stabilized Instance Normalization (SIN), a novel normalization layer that explicitly remedies this incompatibility. SIN stabilizes the hybrid architecture by correcting channel-wise statistical shifts, definitively resolving the catastrophic hypothesis collapse. Extensive experiments on six benchmark datasets demonstrate that TimePre achieves new state-of-the-art accuracy on key probabilistic metrics. Critically, TimePre achieves inference speeds orders of magnitude faster than sampling-based models and, unlike prior MCL work, demonstrates stable performance scaling. It thus bridges the long-standing gap between accuracy, efficiency, and stability in probabilistic forecasting.

CVOct 19, 2021
CoFi: Coarse-to-Fine ICP for LiDAR Localization in an Efficient Long-lasting Point Cloud Map

Yecheng Lyu, Xinming Huang, Ziming Zhang

LiDAR odometry and localization has attracted increasing research interest in recent years. In the existing works, iterative closest point (ICP) is widely used since it is precise and efficient. Due to its non-convexity and its local iterative strategy, however, ICP-based method easily falls into local optima, which in turn calls for a precise initialization. In this paper, we propose CoFi, a Coarse-to-Fine ICP algorithm for LiDAR localization. Specifically, the proposed algorithm down-samples the input point sets under multiple voxel resolution, and gradually refines the transformation from the coarse point sets to the fine-grained point sets. In addition, we propose a map based LiDAR localization algorithm that extracts semantic feature points from the LiDAR frames and apply CoFi to estimate the pose on an efficient point cloud map. With the help of the Cylinder3D algorithm for LiDAR scan semantic segmentation, the proposed CoFi localization algorithm demonstrates the state-of-the-art performance on the KITTI odometry benchmark, with significant improvement over the literature.

LGOct 14, 2021
Don't Knock! Rowhammer at the Backdoor of DNN Models

M. Caner Tol, Saad Islam, Andrew J. Adiletta et al.

State-of-the-art deep neural networks (DNNs) have been proven to be vulnerable to adversarial manipulation and backdoor attacks. Backdoored models deviate from expected behavior on inputs with predefined triggers while retaining performance on clean data. Recent works focus on software simulation of backdoor injection during the inference phase by modifying network weights, which we find often unrealistic in practice due to restrictions in hardware. In contrast, in this work for the first time, we present an end-to-end backdoor injection attack realized on actual hardware on a classifier model using Rowhammer as the fault injection method. To this end, we first investigate the viability of backdoor injection attacks in real-life deployments of DNNs on hardware and address such practical issues in hardware implementation from a novel optimization perspective. We are motivated by the fact that vulnerable memory locations are very rare, device-specific, and sparsely distributed. Consequently, we propose a novel network training algorithm based on constrained optimization to achieve a realistic backdoor injection attack in hardware. By modifying parameters uniformly across the convolutional and fully-connected layers as well as optimizing the trigger pattern together, we achieve state-of-the-art attack performance with fewer bit flips. For instance, our method on a hardware-deployed ResNet-20 model trained on CIFAR-10 achieves over 89% test accuracy and 92% attack success rate by flipping only 10 out of 2.2 million bits.

ROSep 23, 2021
Prediction of Metacarpophalangeal joint angles and Classification of Hand configurations based on Ultrasound Imaging of the Forearm

Keshav Bimbraw, Christopher Julius Nycz, Matt Schueler et al.

With the advancement in computing and robotics, it is necessary to develop fluent and intuitive methods for interacting with digital systems, AR/VR interfaces, and physical robotic systems. Hand movement recognition is widely used to enable this interaction. Hand configuration classification and Metacarpophalangeal (MCP) joint angle detection are important for a comprehensive reconstruction of the hand motion. Surface electromyography and other technologies have been used for the detection of hand motions. Ultrasound images of the forearm offer a way to visualize the internal physiology of the hand from a musculoskeletal perspective. Recent work has shown that these images can be classified using machine learning to predict various hand configurations. In this paper, we propose a Convolutional Neural Network (CNN) based deep learning pipeline for predicting the MCP joint angles. We supplement our results by using a Support Vector Classifier (SVC) to classify the ultrasound information into several predefined hand configurations based on activities of daily living (ADL). Ultrasound data from the forearm was obtained from 6 subjects who were instructed to move their hands according to predefined hand configurations relevant to ADLs. Motion capture data was acquired as the ground truth for hand movements at different speeds (0.5 Hz, 1 Hz, & 2 Hz) for the index, middle, ring, and pinky fingers. We were able to get promising SVC classification results on a subset of our collected data set. We demonstrated a correspondence between the predicted MCP joint angles and the actual MCP joint angles for the fingers, with an average root mean square error of 7.35 degrees. We implemented a low latency (6.25 - 9.1 Hz) pipeline for the prediction of both MCP joint angles and hand configuration estimation aimed at real-time control of digital devices, AR/VR interfaces, and physical robots.

ROSep 13, 2021
LiDAR Odometry Methodologies for Autonomous Driving: A Survey

Nikhil Jonnavithula, Yecheng Lyu, Ziming Zhang

Vehicle odometry is an essential component of an automated driving system as it computes the vehicle's position and orientation. The odometry module has a higher demand and impact in urban areas where the global navigation satellite system (GNSS) signal is weak and noisy. Traditional visual odometry methods suffer from the diverse illumination status and get disparities during pose estimation, which results in significant errors as the error accumulates. Odometry using light detection and ranging (LiDAR) devices has attracted increasing research interest as LiDAR devices are robust to illumination variations. In this survey, we examine the existing LiDAR odometry methods and summarize the pipeline and delineate the several intermediate steps. Additionally, the existing LiDAR odometry methods are categorized by their correspondence type, and their advantages, disadvantages, and correlations are analyzed across-category and within-category in each step. Finally, we compare the accuracy and the running speed among these methodologies evaluated over the KITTI odometry dataset and outline promising future research directions.

CVMay 23, 2021
Revisiting 2D Convolutional Neural Networks for Graph-based Applications

Yecheng Lyu, Xinming Huang, Ziming Zhang

Graph convolutional networks (GCNs) are widely used in graph-based applications such as graph classification and segmentation. However, current GCNs have limitations on implementation such as network architectures due to their irregular inputs. In contrast, convolutional neural networks (CNNs) are capable of extracting rich features from large-scale input data, but they do not support general graph inputs. To bridge the gap between GCNs and CNNs, in this paper we study the problem of how to effectively and efficiently map general graphs to 2D grids that CNNs can be directly applied to, while preserving graph topology as much as possible. We therefore propose two novel graph-to-grid mapping schemes, namely, {\em graph-preserving grid layout (GPGL)} and its extension {\em Hierarchical GPGL (H-GPGL)} for computational efficiency. We formulate the GPGL problem as integer programming and further propose an approximate yet efficient solver based on a penalized Kamada-Kawai method, a well-known optimization algorithm in 2D graph drawing. We propose a novel vertex separation penalty that encourages graph vertices to lay on the grid without any overlap. Along with this image representation, even extra 2D maxpooling layers contribute to the PointNet, a widely applied point-based neural network. We demonstrate the empirical success of GPGL on general graph classification with small graphs and H-GPGL on 3D point cloud segmentation with large graphs, based on 2D CNNs including VGG16, ResNet50 and multi-scale maxout (MSM) CNN.

CVApr 22, 2021
Deep Lucas-Kanade Homography for Multimodal Image Alignment

Yiming Zhao, Xinming Huang, Ziming Zhang

Estimating homography to align image pairs captured by different sensors or image pairs with large appearance changes is an important and general challenge for many computer vision applications. In contrast to others, we propose a generic solution to pixel-wise align multimodal image pairs by extending the traditional Lucas-Kanade algorithm with networks. The key contribution in our method is how we construct feature maps, named as deep Lucas-Kanade feature map (DLKFM). The learned DLKFM can spontaneously recognize invariant features under various appearance-changing conditions. It also has two nice properties for the Lucas-Kanade algorithm: (1) The template feature map keeps brightness consistency with the input feature map, thus the color difference is very small while they are well-aligned. (2) The Lucas-Kanade objective function built on DLKFM has a smooth landscape around ground truth homography parameters, so the iterative solution of the Lucas-Kanade can easily converge to the ground truth. With those properties, directly updating the Lucas-Kanade algorithm on our feature maps will precisely align image pairs with large appearance changes. We share the datasets, code, and demo video online.

CVApr 17, 2021
A Surface Geometry Model for LiDAR Depth Completion

Yiming Zhao, Lin Bai, Ziming Zhang et al.

LiDAR depth completion is a task that predicts depth values for every pixel on the corresponding camera frame, although only sparse LiDAR points are available. Most of the existing state-of-the-art solutions are based on deep neural networks, which need a large amount of data and heavy computations for training the models. In this letter, a novel non-learning depth completion method is proposed by exploiting the local surface geometry that is enhanced by an outlier removal algorithm. The proposed surface geometry model is inspired by the observation that most pixels with unknown depth have a nearby LiDAR point. Therefore, it is assumed those pixels share the same surface with the nearest LiDAR point, and their respective depth can be estimated as the nearest LiDAR depth value plus a residual error. The residual error is calculated by using a derived equation with several physical parameters as input, including the known camera intrinsic parameters, estimated normal vector, and offset distance on the image plane. The proposed method is further enhanced by an outlier removal algorithm that is designed to remove incorrectly mapped LiDAR points from occluded regions. On KITTI dataset, the proposed solution achieves the best error performance among all existing non-learning methods and is comparable to the best self-supervised learning method and some supervised learning methods. Moreover, since outlier points from occluded regions is a commonly existing problem, the proposed outlier removal algorithm is a general preprocessing step that is applicable to many robotic systems with both camera and LiDAR sensors.

CVMar 3, 2021
EllipsoidNet: Ellipsoid Representation for Point Cloud Classification and Segmentation

Yecheng Lyu, Xinming Huang, Ziming Zhang

Point cloud patterns are hard to learn because of the implicit local geometry features among the orderless points. In recent years, point cloud representation in 2D space has attracted increasing research interest since it exposes the local geometry features in a 2D space. By projecting those points to a 2D feature map, the relationship between points is inherited in the context between pixels, which are further extracted by a 2D convolutional neural network. However, existing 2D representing methods are either accuracy limited or time-consuming. In this paper, we propose a novel 2D representation method that projects a point cloud onto an ellipsoid surface space, where local patterns are well exposed in ellipsoid-level and point-level. Additionally, a novel convolutional neural network named EllipsoidNet is proposed to utilize those features for point cloud classification and segmentation applications. The proposed methods are evaluated in ModelNet40 and ShapeNet benchmarks, where the advantages are clearly shown over existing 2D representation methods.

CVOct 19, 2020
Self-supervised Geometric Features Discovery via Interpretable Attention for Vehicle Re-Identification and Beyond

Ming Li, Xinming Huang, Ziming Zhang

To learn distinguishable patterns, most of recent works in vehicle re-identification (ReID) struggled to redevelop official benchmarks to provide various supervisions, which requires prohibitive human labors. In this paper, we seek to achieve the similar goal but do not involve more human efforts. To this end, we introduce a novel framework, which successfully encodes both geometric local features and global representations to distinguish vehicle instances, optimized only by the supervision from official ID labels. Specifically, given our insight that objects in ReID share similar geometric characteristics, we propose to borrow self-supervised representation learning to facilitate geometric features discovery. To condense these features, we introduce an interpretable attention module, with the core of local maxima aggregation instead of fully automatic learning, whose mechanism is completely understandable and whose response map is physically reasonable. To the best of our knowledge, we are the first that perform self-supervised learning to discover geometric features. We conduct comprehensive experiments on three most popular datasets for vehicle ReID, i.e., VeRi-776, CityFlow-ReID, and VehicleID. We report our state-of-the-art (SOTA) performances and promising visualization results. We also show the excellent scalability of our approach on other ReID related tasks, i.e., person ReID and multi-target multi-camera (MTMC) vehicle tracking.

LGOct 12, 2020
RNN Training along Locally Optimal Trajectories via Frank-Wolfe Algorithm

Yun Yue, Ming Li, Venkatesh Saligrama et al.

We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region, and leverage this directional vector for the update, in an outer-loop. We propose to utilize the Frank-Wolfe (FW) algorithm in this context. Although, FW implicitly involves normalized gradients, which can lead to a slow convergence rate, we develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation. Our method leads to a new Frank-Wolfe method, that is in essence an SGD algorithm with a restart scheme. We prove that under certain conditions our algorithm has a sublinear convergence rate of $O(1/ε)$ for $ε$ error. We then conduct empirical experiments on several benchmark datasets including those that exhibit long-term dependencies, and show significant performance improvement. We also experiment with deep RNN architectures and show efficient training performance. Finally, we demonstrate that our training method is robust to noisy data.

LGOct 2, 2020
$f$-GAIL: Learning $f$-Divergence for Generative Adversarial Imitation Learning

Xin Zhang, Yanhua Li, Ziming Zhang et al.

Imitation learning (IL) aims to learn a policy from expert demonstrations that minimizes the discrepancy between the learner and expert behaviors. Various imitation learning algorithms have been proposed with different pre-determined divergences to quantify the discrepancy. This naturally gives rise to the following question: Given a set of expert demonstrations, which divergence can recover the expert policy more accurately with higher data efficiency? In this work, we propose $f$-GAIL, a new generative adversarial imitation learning (GAIL) model, that automatically learns a discrepancy measure from the $f$-divergence family as well as a policy capable of producing expert-like behaviors. Compared with IL baselines with various predefined divergence measures, $f$-GAIL learns better policies with higher data efficiency in six physics-based control tasks.

CVSep 1, 2020
LodoNet: A Deep Neural Network with 2D Keypoint Matchingfor 3D LiDAR Odometry Estimation

Ce Zheng, Yecheng Lyu, Ming Li et al.

Deep learning based LiDAR odometry (LO) estimation attracts increasing research interests in the field of autonomous driving and robotics. Existing works feed consecutive LiDAR frames into neural networks as point clouds and match pairs in the learned feature space. In contrast, motivated by the success of image based feature extractors, we propose to transfer the LiDAR frames to image space and reformulate the problem as image feature extraction. With the help of scale-invariant feature transform (SIFT) for feature extraction, we are able to generate matched keypoint pairs (MKPs) that can be precisely returned to the 3D space. A convolutional neural network pipeline is designed for LiDAR odometry estimation by extracted MKPs. The proposed scheme, namely LodoNet, is then evaluated in the KITTI odometry estimation benchmark, achieving on par with or even better results than the state-of-the-art.

CVJun 21, 2020
TreeRNN: Topology-Preserving Deep GraphEmbedding and Learning

Yecheng Lyu, Ming Li, Xinming Huang et al.

General graphs are difficult for learning due to their irregular structures. Existing works employ message passing along graph edges to extract local patterns using customized graph kernels, but few of them are effective for the integration of such local patterns into global features. In contrast, in this paper we study the methods to transfer the graphs into trees so that explicit orders are learned to direct the feature integration from local to global. To this end, we apply the breadth first search (BFS) to construct trees from the graphs, which adds direction to the graph edges from the center node to the peripheral nodes. In addition, we proposed a novel projection scheme that transfer the trees to image representations, which is suitable for conventional convolution neural networks (CNNs) and recurrent neural networks (RNNs). To best learn the patterns from the graph-tree-images, we propose TreeRNN, a 2D RNN architecture that recurrently integrates the image pixels by rows and columns to help classify the graph categories. We evaluate the proposed method on several graph classification datasets, and manage to demonstrate comparable accuracy with the state-of-the-art on MUTAG, PTC-MR and NCI1 datasets.

CVJun 1, 2020
Automatic Building and Labeling of HD Maps with Deep Learning

Mahdi Elhousni, Yecheng Lyu, Ziming Zhang et al.

In a world where autonomous driving cars are becoming increasingly more common, creating an adequate infrastructure for this new technology is essential. This includes building and labeling high-definition (HD) maps accurately and efficiently. Today, the process of creating HD maps requires a lot of human input, which takes time and is prone to errors. In this paper, we propose a novel method capable of generating labelled HD maps from raw sensor data. We implemented and tested our methods on several urban scenarios using data collected from our test vehicle. The results show that the pro-posed deep learning based method can produce highly accurate HD maps. This approach speeds up the process of building and labeling HD maps, which can make meaningful contribution to the deployment of autonomous vehicle.

CVJan 5, 2020
Self-Orthogonality Module: A Network Architecture Plug-in for Learning Orthogonal Filters

Ziming Zhang, Wenchi Ma, Yuanwei Wu et al.

In this paper, we investigate the empirical impact of orthogonality regularization (OR) in deep learning, either solo or collaboratively. Recent works on OR showed some promising results on the accuracy. In our ablation study, however, we do not observe such significant improvement from existing OR techniques compared with the conventional training based on weight decay, dropout, and batch normalization. To identify the real gain from OR, inspired by the locality sensitive hashing (LSH) in angle estimation, we propose to introduce an implicit self-regularization into OR to push the mean and variance of filter angles in a network towards 90 and 0 simultaneously to achieve (near) orthogonality among the filters, without using any other explicit regularization. Our regularization can be implemented as an architectural plug-in and integrated with an arbitrary network. We reveal that OR helps stabilize the training process and leads to faster convergence and better generalization.

LGSep 26, 2019
Graph-Preserving Grid Layout: A Simple Graph Drawing Method for Graph Classification using CNNs

Yecheng Lyu, Xinming Huang, Ziming Zhang

Graph convolutional networks (GCNs) suffer from the irregularity of graphs, while more widely-used convolutional neural networks (CNNs) benefit from regular grids. To bridge the gap between GCN and CNN, in contrast to previous works on generalizing the basic operations in CNNs to graph data, in this paper we address the problem of how to project undirected graphs onto the grid in a {\em principled} way where CNNs can be used as backbone for geometric deep learning. To this end, inspired by the literature of graph drawing we propose a novel graph-preserving grid layout (GPGL), an integer programming that minimizes the topological loss on the grid. Technically we propose solving GPGL approximately using a {\em regularized} Kamada-Kawai algorithm, a well-known nonconvex optimization technique in graph drawing, with a vertex separation penalty that improves the rounding performance on top of the solutions from relaxation. Using GPGL we can easily conduct data augmentation as every local minimum will lead to a grid layout for the same graph. Together with the help of multi-scale maxout CNNs, we demonstrate the empirical success of our method for graph classification.

LGSep 13, 2019
White-Box Adversarial Defense via Self-Supervised Data Estimation

Zudi Lin, Hanspeter Pfister, Ziming Zhang

In this paper, we study the problem of how to defend classifiers against adversarial attacks that fool the classifiers using subtly modified input data. In contrast to previous works, here we focus on the white-box adversarial defense where the attackers are granted full access to not only the classifiers but also defenders to produce as strong attacks as possible. In such a context we propose viewing a defender as a functional, a higher-order function that takes functions as its argument to represent a function space, rather than fixed functions conventionally. From this perspective, a defender should be realized and optimized individually for each adversarial input. To this end, we propose RIDE, an efficient and provably convergent self-supervised learning algorithm for individual data estimation to protect the predictions from adversarial attacks. We demonstrate the significant improvement of adversarial defense performance on image recognition, eg, 98%, 76%, 43% test accuracy on MNIST, CIFAR-10, and ImageNet datasets respectively under the state-of-the-art BPDA attacker.

CVAug 31, 2019
Towards Learning Affine-Invariant Representations via Data-Efficient CNNs

Xenju Xu, Guanghui Wang, Alan Sullivan et al.

In this paper we propose integrating a priori knowledge into both design and training of convolutional neural networks (CNNs) to learn object representations that are invariant to affine transformations (i.e., translation, scale, rotation). Accordingly we propose a novel multi-scale maxout CNN and train it end-to-end with a novel rotation-invariant regularizer. This regularizer aims to enforce the weights in each 2D spatial filter to approximate circular patterns. In this way, we manage to handle affine transformations in training using convolution, multi-scale maxout, and circular filters. Empirically we demonstrate that such knowledge can significantly improve the data-efficiency as well as generalization and robustness of learned models. For instance, on the Traffic Sign data set and trained with only 10 images per class, our method can achieve 84.15% that outperforms the state-of-the-art by 29.80% in terms of test accuracy.

CVAug 27, 2019
Unsupervised Deep Feature Transfer for Low Resolution Image Classification

Yuanwei Wu, Ziming Zhang, Guanghui Wang

In this paper, we propose a simple while effective unsupervised deep feature transfer algorithm for low resolution image classification. No fine-tuning on convenet filters is required in our method. We use pre-trained convenet to extract features for both high- and low-resolution images, and then feed them into a two-layer feature transfer network for knowledge transfer. A SVM classifier is learned directly using these transferred low resolution features. Our network can be embedded into the state-of-the-art deep neural networks as a plug-in feature enhancement module. It preserves data structures in feature space for high resolution images, and transfers the distinguishing features from a well-structured source domain (high resolution features space) to a not well-organized target domain (low resolution features space). Extensive experiments on VOC2007 test set show that the proposed method achieves significant improvements over the baseline of using feature extraction.

LGAug 22, 2019
RNNs Evolving on an Equilibrium Manifold: A Panacea for Vanishing and Exploding Gradients?

Anil Kag, Ziming Zhang, Venkatesh Saligrama

Recurrent neural networks (RNNs) are particularly well-suited for modeling long-term dependencies in sequential data, but are notoriously hard to train because the error backpropagated in time either vanishes or explodes at an exponential rate. While a number of works attempt to mitigate this effect through gated recurrent units, well-chosen parametric constraints, and skip-connections, we develop a novel perspective that seeks to evolve the hidden state on the equilibrium manifold of an ordinary differential equation (ODE). We propose a family of novel RNNs, namely {\em Equilibriated Recurrent Neural Networks} (ERNNs) that overcome the gradient decay or explosion effect and lead to recurrent models that evolve on the equilibrium manifold. We show that equilibrium points are stable, leading to fast convergence of the discretized ODE to fixed points. Furthermore, ERNNs account for long-term dependencies, and can efficiently recall informative aspects of data from the distant past. We show that ERNNs achieve state-of-the-art accuracy on many challenging data sets with 3-10x speedups, 1.5-3x model size reduction, and with similar prediction cost relative to vanilla RNNs.

CVMar 26, 2019
Verification of Very Low-Resolution Faces Using An Identity-Preserving Deep Face Super-Resolution Network

Esra Ataer-Cansizoglu, Michael Jones, Ziming Zhang et al.

Face super-resolution methods usually aim at producing visually appealing results rather than preserving distinctive features for further face identification. In this work, we propose a deep learning method for face verification on very low-resolution face images that involves identity-preserving face super-resolution. Our framework includes a super-resolution network and a feature extraction network. We train a VGG-based deep face recognition network (Parkhi et al. 2015) to be used as feature extractor. Our super-resolution network is trained to minimize the feature distance between the high resolution ground truth image and the super-resolved image, where features are extracted using our pre-trained feature extraction network. We carry out experiments on FRGC, Multi-PIE, LFW-a, and MegaFace datasets to evaluate our method in controlled and uncontrolled settings. The results show that the presented method outperforms conventional super-resolution methods in low-resolution face verification.

LGMar 2, 2019
Time-Delay Momentum: A Regularization Perspective on the Convergence and Generalization of Stochastic Momentum for Deep Learning

Ziming Zhang, Wenju Xu, Alan Sullivan

In this paper we study the problem of convergence and generalization error bound of stochastic momentum for deep learning from the perspective of regularization. To do so, we first interpret momentum as solving an $\ell_2$-regularized minimization problem to learn the offsets between arbitrary two successive model parameters. We call this {\em time-delay momentum} because the model parameter is updated after a few iterations towards finding the minimizer. We then propose our learning algorithm, \ie stochastic gradient descent (SGD) with time-delay momentum. We show that our algorithm can be interpreted as solving a sequence of strongly convex optimization problems using SGD. We prove that under mild conditions our algorithm can converge to a stationary point with rate of $O(\frac{1}{\sqrt{K}})$ and generalization error bound of $O(\frac{1}{\sqrt{nδ}})$ with probability at least $1-δ$, where $K,n$ are the numbers of model updates and training samples, respectively. We demonstrate the empirical superiority of our algorithm in deep learning in comparison with the state-of-the-art deep learning solvers.

LGMar 2, 2019
Equilibrated Recurrent Neural Network: Neuronal Time-Delayed Self-Feedback Improves Accuracy and Stability

Ziming Zhang, Anil Kag, Alan Sullivan et al.

We propose a novel {\it Equilibrated Recurrent Neural Network} (ERNN) to combat the issues of inaccuracy and instability in conventional RNNs. Drawing upon the concept of autapse in neuroscience, we propose augmenting an RNN with a time-delayed self-feedback loop. Our sole purpose is to modify the dynamics of each internal RNN state and, at any time, enforce it to evolve close to the equilibrium point associated with the input signal at that time. We show that such self-feedback helps stabilize the hidden state transitions leading to fast convergence during training while efficiently learning discriminative latent features that result in state-of-the-art results on several benchmark datasets at test-time. We propose a novel inexact Newton method to solve fixed-point conditions given model parameters for generating the latent features at each hidden state. We prove that our inexact Newton method converges locally with linear rate (under mild conditions). We leverage this result for efficient training of ERNNs based on backpropagation.

MLMay 22, 2018
Deformable Part Networks

Ziming Zhang, Rongmei Lin, Alan Sullivan

In this paper we propose novel Deformable Part Networks (DPNs) to learn {\em pose-invariant} representations for 2D object recognition. In contrast to the state-of-the-art pose-aware networks such as CapsNet \cite{sabour2017dynamic} and STN \cite{jaderberg2015spatial}, DPNs can be naturally {\em interpreted} as an efficient solver for a challenging detection problem, namely Localized Deformable Part Models (LDPMs) where localization is introduced to DPMs as another latent variable for searching for the best poses of objects over all pixels and (predefined) scales. In particular we construct DPNs as sequences of such LDPM units to model the semantic and spatial relations among the deformable parts as hierarchical composition and spatial parsing trees. Empirically our 17-layer DPN can outperform both CapsNets and STNs significantly on affNIST \cite{sabour2017dynamic}, for instance, by 19.19\% and 12.75\%, respectively, with better generalization and better tolerance to affine transformations.