CVOct 12, 2022
QDTrack: Quasi-Dense Similarity Learning for Appearance-Only Multiple Object TrackingTobias Fischer, Thomas E. Huang, Jiangmiao Pang et al. · eth-zurich, mit
Similarity learning has been recognized as a crucial step for object tracking. However, existing multiple object tracking methods only use sparse ground truth matching as the training objective, while ignoring the majority of the informative regions in images. In this paper, we present Quasi-Dense Similarity Learning, which densely samples hundreds of object regions on a pair of images for contrastive learning. We combine this similarity learning with multiple existing object detectors to build Quasi-Dense Tracking (QDTrack), which does not require displacement regression or motion priors. We find that the resulting distinctive feature space admits a simple nearest neighbor search at inference time for object association. In addition, we show that our similarity learning scheme is not limited to video data, but can learn effective instance similarity even from static input, enabling a competitive tracking performance without training on videos or using tracking supervision. We conduct extensive experiments on a wide variety of popular MOT benchmarks. We find that, despite its simplicity, QDTrack rivals the performance of state-of-the-art tracking methods on all benchmarks and sets a new state-of-the-art on the large-scale BDD100K MOT benchmark, while introducing negligible computational overhead to the detector.
LGJul 25, 2024
A Two-Stage Imaging Framework Combining CNN and Physics-Informed Neural Networks for Full-Inverse Tomography: A Case Study in Electrical Impedance Tomography (EIT)Xuanxuan Yang, Yangming Zhang, Haofeng Chen et al.
Electrical Impedance Tomography (EIT) is a highly ill-posed inverse problem, with the challenge of reconstructing internal conductivities using only boundary voltage measurements. Although Physics-Informed Neural Networks (PINNs) have shown potential in solving inverse problems, existing approaches are limited in their applicability to EIT, as they often rely on impractical prior knowledge and assumptions that cannot be satisfied in real-world scenarios. To address these limitations, we propose a two-stage hybrid learning framework that combines Convolutional Neural Networks (CNNs) and PINNs. This framework integrates data-driven and model-driven paradigms, blending supervised and unsupervised learning to reconstruct conductivity distributions while ensuring adherence to the underlying physical laws, thereby overcoming the constraints of existing methods.
LGDec 3, 2025
Physics-Driven Learning Framework for Tomographic Tactile SensingXuanxuan Yang, Xiuyang Zhang, Haofeng Chen et al.
Electrical impedance tomography (EIT) provides an attractive solution for large-area tactile sensing due to its minimal wiring and shape flexibility, but its nonlinear inverse problem often leads to severe artifacts and inaccurate contact reconstruction. This work presents PhyDNN, a physics-driven deep reconstruction framework that embeds the EIT forward model directly into the learning objective. By jointly minimizing the discrepancy between predicted and ground-truth conductivity maps and enforcing consistency with the forward PDE, PhyDNN reduces the black-box nature of deep networks and improves both physical plausibility and generalization. To enable efficient backpropagation, we design a differentiable forward-operator network that accurately approximates the nonlinear EIT response, allowing fast physics-guided training. Extensive simulations and real tactile experiments on a 16-electrode soft sensor show that PhyDNN consistently outperforms NOSER, TV, and standard DNNs in reconstructing contact shape, location, and pressure distribution. PhyDNN yields fewer artifacts, sharper boundaries, and higher metric scores, demonstrating its effectiveness for high-quality tomographic tactile sensing.
CVNov 23, 2025Code
SO-Bench: A Structural Output Evaluation of Multimodal LLMsDi Feng, Kaixin Ma, Feng Nan et al.
Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.
CVMay 22, 2025
NTIRE 2025 challenge on Text to Image Generation Model Quality AssessmentShuhao Han, Haotian Fan, Fangyuan Kong et al.
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.
CVMay 11, 2021
Home Action Genome: Cooperative Compositional Action UnderstandingNishant Rai, Haofeng Chen, Jingwei Ji et al.
Existing research on action recognition treats activities as monolithic events occurring in videos. Recently, the benefits of formulating actions as a combination of atomic-actions have shown promise in improving action understanding with the emergence of datasets containing such annotations, allowing us to learn representations capturing this information. However, there remains a lack of studies that extend action composition and leverage multiple viewpoints and multiple modalities of data for representation learning. To promote research in this direction, we introduce Home Action Genome (HOMAGE): a multi-view action dataset with multiple modalities and view-points supplemented with hierarchical activity and atomic action labels together with dense scene composition labels. Leveraging rich multi-modal and multi-view settings, we propose Cooperative Compositional Action Understanding (CCAU), a cooperative learning framework for hierarchical action recognition that is aware of compositional action elements. CCAU shows consistent performance improvements across all modalities. Furthermore, we demonstrate the utility of co-learning compositions in few-shot action recognition by achieving 28.6% mAP with just a single sample.
CVJun 11, 2020
Quasi-Dense Similarity Learning for Multiple Object TrackingJiangmiao Pang, Linlu Qiu, Xia Li et al.
Similarity learning has been recognized as a crucial step for object tracking. However, existing multiple object tracking methods only use sparse ground truth matching as the training objective, while ignoring the majority of the informative regions on the images. In this paper, we present Quasi-Dense Similarity Learning, which densely samples hundreds of region proposals on a pair of images for contrastive learning. We can directly combine this similarity learning with existing detection methods to build Quasi-Dense Tracking (QDTrack) without turning to displacement regression or motion priors. We also find that the resulting distinctive feature space admits a simple nearest neighbor search at the inference time. Despite its simplicity, QDTrack outperforms all existing methods on MOT, BDD100K, Waymo, and TAO tracking benchmarks. It achieves 68.7 MOTA at 20.3 FPS on MOT17 without using external training data. Compared to methods with similar detectors, it boosts almost 10 points of MOTA and significantly decreases the number of ID switches on BDD100K and Waymo datasets. Our code and trained models are available at http://vis.xyz/pub/qdtrack.
CVMay 12, 2018
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask LearningFisher Yu, Haofeng Chen, Xin Wang et al.
Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of various complexities. We construct BDD100K, the largest driving video dataset with 100K videos and 10 tasks to evaluate the exciting progress of image recognition algorithms on autonomous driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. Based on this diverse dataset, we build a benchmark for heterogeneous multitask learning and study how to solve the tasks together. Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks. BDD100K opens the door for future studies in this important venue.
RONov 4, 2017
Analyzing Material Recognition Performance of Thermal Tactile Sensing using a Large Materials Database and a Real RobotHaoping Bai, Haofeng Chen, Elizabeth Healy et al.
In this paper we focus on analyzing the thermal modality of tactile sensing for material recognition using a large materials database. Many factors affect thermal recognition performance, including sensor noise, the initial temperatures of the sensor and the object, the thermal effusivities of the materials, and the duration of contact. To analyze the influence of these factors on thermal recognition, we used a semi-infinite solid based thermal model to simulate heat-transfer data from all the materials in the CES Edupack Level-1 database. We used support-vector machines (SVMs) to predict F1 scores for binary material recognition for 2346 material pairs. We also collected data using a real robot equipped with a thermal sensor and analyzed its material recognition performance on 66 real-world material pairs. Additionally, we analyzed the performance when the models were trained on the simulated data and tested on the real-robot data. Our models predicted the material recognition performance with a 0.980 F1 score for the simulated data, a 0.994 F1 score for real-world data with constant initial sensor temperatures, a 0.966 F1 score for real-world data with varied initial sensor temperatures, and a 0.815 F1 score for sim-to-real transfer. Finally, we present some guidelines on sensor design and parameter choice for thermal recognition based on the insights gained from these results that would hopefully enable robotics researchers to use this less-explored tactile sensing modality more effectively during physical human-robot and robot-object interactions. We release our simulated and real-robot datasets for further use by the robotics community.