CVNov 4, 2023Code
MC-Stereo: Multi-peak Lookup and Cascade Search Range for Stereo MatchingMiaojie Feng, Junda Cheng, Hao Jia et al.
Stereo matching is a fundamental task in scene comprehension. In recent years, the method based on iterative optimization has shown promise in stereo matching. However, the current iteration framework employs a single-peak lookup, which struggles to handle the multi-peak problem effectively. Additionally, the fixed search range used during the iteration process limits the final convergence effects. To address these issues, we present a novel iterative optimization architecture called MC-Stereo. This architecture mitigates the multi-peak distribution problem in matching through the multi-peak lookup strategy, and integrates the coarse-to-fine concept into the iterative framework via the cascade search range. Furthermore, given that feature representation learning is crucial for successful learn-based stereo matching, we introduce a pre-trained network to serve as the feature extractor, enhancing the front end of the stereo matching pipeline. Based on these improvements, MC-Stereo ranks first among all publicly available methods on the KITTI-2012 and KITTI-2015 benchmarks, and also achieves state-of-the-art performance on ETH3D. Code is available at https://github.com/MiaoJieF/MC-Stereo.
LGMar 15, 2022
Supervised Contrastive Learning with Structure Inference for Graph ClassificationHao Jia, Junzhong Ji, Minglong Lei
Advanced graph neural networks have shown great potentials in graph classification tasks recently. Different from node classification where node embeddings aggregated from local neighbors can be directly used to learn node labels, graph classification requires a hierarchical accumulation of different levels of topological information to generate discriminative graph embeddings. Still, how to fully explore graph structures and formulate an effective graph classification pipeline remains rudimentary. In this paper, we propose a novel graph neural network based on supervised contrastive learning with structure inference for graph classification. First, we propose a data-driven graph augmentation strategy that can discover additional connections to enhance the existing edge set. Concretely, we resort to a structure inference stage based on diffusion cascades to recover possible connections with high node similarities. Second, to improve the contrastive power of graph neural networks, we propose to use a supervised contrastive loss for graph classification. With the integration of label information, the one-vs-many contrastive learning can be extended to a many-vs-many setting, so that the graph-level embeddings with higher topological similarities will be pulled closer. The supervised contrastive loss and structure inference can be naturally incorporated within the hierarchical graph neural networks where the topological patterns can be fully explored to produce discriminative graph embeddings. Experiment results show the effectiveness of the proposed method compared with recent state-of-the-art methods.
CVMar 1, 2024Code
Selective-Stereo: Adaptive Frequency Information Selection for Stereo MatchingXianqi Wang, Gangwei Xu, Hao Jia et al.
Stereo matching methods based on iterative optimization, like RAFT-Stereo and IGEV-Stereo, have evolved into a cornerstone in the field of stereo matching. However, these methods struggle to simultaneously capture high-frequency information in edges and low-frequency information in smooth regions due to the fixed receptive field. As a result, they tend to lose details, blur edges, and produce false matches in textureless areas. In this paper, we propose Selective Recurrent Unit (SRU), a novel iterative update operator for stereo matching. The SRU module can adaptively fuse hidden disparity information at multiple frequencies for edge and smooth regions. To perform adaptive fusion, we introduce a new Contextual Spatial Attention (CSA) module to generate attention maps as fusion weights. The SRU empowers the network to aggregate hidden disparity information across multiple frequencies, mitigating the risk of vital hidden disparity information loss during iterative processes. To verify SRU's universality, we apply it to representative iterative stereo matching methods, collectively referred to as Selective-Stereo. Our Selective-Stereo ranks $1^{st}$ on KITTI 2012, KITTI 2015, ETH3D, and Middlebury leaderboards among all published methods. Code is available at https://github.com/Windsrain/Selective-Stereo.
CLMay 10, 2022
AdMix: A Mixed Sample Data Augmentation Method for Neural Machine TranslationChang Jin, Shigui Qiu, Nini Xiao et al.
In Neural Machine Translation (NMT), data augmentation methods such as back-translation have proven their effectiveness in improving translation performance. In this paper, we propose a novel data augmentation approach for NMT, which is independent of any additional training data. Our approach, AdMix, consists of two parts: 1) introduce faint discrete noise (word replacement, word dropping, word swapping) into the original sentence pairs to form augmented samples; 2) generate new synthetic training data by softly mixing the augmented samples with their original samples in training corpus. Experiments on three translation datasets of different scales show that AdMix achieves signifi cant improvements (1.0 to 2.7 BLEU points) over strong Transformer baseline. When combined with other data augmentation techniques (e.g., back-translation), our approach can obtain further improvements.
CVDec 28, 2023Code
FlowDA: Unsupervised Domain Adaptive Framework for Optical Flow EstimationMiaojie Feng, Longliang Liu, Hao Jia et al.
Collecting real-world optical flow datasets is a formidable challenge due to the high cost of labeling. A shortage of datasets significantly constrains the real-world performance of optical flow models. Building virtual datasets that resemble real scenarios offers a potential solution for performance enhancement, yet a domain gap separates virtual and real datasets. This paper introduces FlowDA, an unsupervised domain adaptive (UDA) framework for optical flow estimation. FlowDA employs a UDA architecture based on mean-teacher and integrates concepts and techniques in unsupervised optical flow estimation. Furthermore, an Adaptive Curriculum Weighting (ACW) module based on curriculum learning is proposed to enhance the training effectiveness. Experimental outcomes demonstrate that our FlowDA outperforms state-of-the-art unsupervised optical flow estimation method SMURF by 21.6%, real optical flow dataset generation method MPI-Flow by 27.8%, and optical flow estimation adaptive method FlowSupervisor by 30.9%, offering novel insights for enhancing the performance of optical flow estimation in real-world scenarios. The code will be open-sourced after the publication of this paper.
59.4AIMay 9
PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal ForecastingHao Wu, Fan Xu, Yuxu Lu et al.
Coupled spatiotemporal forecasting is important for predicting the future evolution of multiple interacting dynamical systems, such as in climate models. However, existing methods are severely constrained by the persistent bottleneck of compounding errors. In coupled systems, errors from each subsystem simulator propagate and amplify one another, a phenomenon we term Reciprocal Error Amplification, leading to a rapid collapse of long-range predictions. To address this challenge, we propose a universal framework called PnP-Corrector (Plug-and-Play Corrector). The core idea of our framework is to decouple the physical simulation from the error correction process: it freezes pre-trained physics simulation engines and exclusively trains a correction agent to proactively counteract the systematic biases emerging from the coupled system. Furthermore, we design an efficient predictive model architecture, DSLCast, to serve as the backbone of this framework. Extensive experiments demonstrate that our method significantly enhances the long-term stability and accuracy of coupled forecasting systems. For instance, in the challenging task of a 300-day global ocean-atmosphere coupled forecast, our PnP-Corrector framework reduces the prediction error of the baseline model by 29% and surpasses state-of-the-art models on several key metrics.
AIFeb 25
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile DevicesDezhi Kong, Zhengzhao Feng, Qiliang Liang et al.
Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting any non-compliant entries. Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation.
HEP-EXDec 5, 2023
CaloQVAE : Simulating high-energy particle-calorimeter interactions using hybrid quantum-classical generative modelsSehmimul Hoque, Hao Jia, Abhishek Abhishek et al.
The Large Hadron Collider's high luminosity era presents major computational challenges in the analysis of collision events. Large amounts of Monte Carlo (MC) simulation will be required to constrain the statistical uncertainties of the simulated datasets below these of the experimental data. Modelling of high-energy particles propagating through the calorimeter section of the detector is the most computationally intensive MC simulation task. We introduce a technique combining recent advancements in generative models and quantum annealing for fast and efficient simulation of high-energy particle-calorimeter interactions.
CVDec 6, 2023
Memory-Efficient Optical Flow via Radius-Distribution Orthogonal Cost VolumeGangwei Xu, Shujun Chen, Hao Jia et al.
The full 4D cost volume in Recurrent All-Pairs Field Transforms (RAFT) or global matching by Transformer achieves impressive performance for optical flow estimation. However, their memory consumption increases quadratically with input resolution, rendering them impractical for high-resolution images. In this paper, we present MeFlow, a novel memory-efficient method for high-resolution optical flow estimation. The key of MeFlow is a recurrent local orthogonal cost volume representation, which decomposes the 2D search space dynamically into two 1D orthogonal spaces, enabling our method to scale effectively to very high-resolution inputs. To preserve essential information in the orthogonal space, we utilize self attention to propagate feature information from the 2D space to the orthogonal space. We further propose a radius-distribution multi-scale lookup strategy to model the correspondences of large displacements at a negligible cost. We verify the efficiency and effectiveness of our method on the challenging Sintel and KITTI benchmarks, and real-world 4K ($2160\!\times\!3840$) images. Our method achieves competitive performance on both Sintel and KITTI benchmarks, while maintaining the highest memory efficiency on high-resolution inputs.
LGOct 30, 2024
Conditioned quantum-assisted deep generative surrogate for particle-calorimeter interactionsJ. Quetzalcoatl Toledo-Marin, Sebastian Gonzalez, Hao Jia et al.
Particle collisions at accelerators such as the Large Hadron Collider, recorded and analyzed by experiments such as ATLAS and CMS, enable exquisite measurements of the Standard Model and searches for new phenomena. Simulations of collision events at these detectors have played a pivotal role in shaping the design of future experiments and analyzing ongoing ones. However, the quest for accuracy in Large Hadron Collider (LHC) collisions comes at an imposing computational cost, with projections estimating the need for millions of CPU-years annually during the High Luminosity LHC (HL-LHC) run \cite{collaboration2022atlas}. Simulating a single LHC event with \textsc{Geant4} currently devours around 1000 CPU seconds, with simulations of the calorimeter subdetectors in particular imposing substantial computational demands \cite{rousseau2023experimental}. To address this challenge, we propose a conditioned quantum-assisted deep generative model. Our model integrates a conditioned variational autoencoder (VAE) on the exterior with a conditioned Restricted Boltzmann Machine (RBM) in the latent space, providing enhanced expressiveness compared to conventional VAEs. The RBM nodes and connections are meticulously engineered to enable the use of qubits and couplers on D-Wave's Pegasus-structured \textit{Advantage} quantum annealer (QA) for sampling. We introduce a novel method for conditioning the quantum-assisted RBM using \textit{flux biases}. We further propose a novel adaptive mapping to estimate the effective inverse temperature in quantum annealers. The effectiveness of our framework is illustrated using Dataset 2 of the CaloChallenge \cite{calochallenge}.
LGOct 28, 2025
Learning from History: A Retrieval-Augmented Framework for Spatiotemporal PredictionHao Jia, Penghao Zhao, Hao Wu et al.
Accurate and long-term spatiotemporal prediction for complex physical systems remains a fundamental challenge in scientific computing. While deep learning models, as powerful parametric approximators, have shown remarkable success, they suffer from a critical limitation: the accumulation of errors during long-term autoregressive rollouts often leads to physically implausible artifacts. This deficiency arises from their purely parametric nature, which struggles to capture the full constraints of a system's intrinsic dynamics. To address this, we introduce a novel \textbf{Retrieval-Augmented Prediction (RAP)} framework, a hybrid paradigm that synergizes the predictive power of deep networks with the grounded truth of historical data. The core philosophy of RAP is to leverage historical evolutionary exemplars as a non-parametric estimate of the system's local dynamics. For any given state, RAP efficiently retrieves the most similar historical analog from a large-scale database. The true future evolution of this analog then serves as a \textbf{reference target}. Critically, this target is not a hard constraint in the loss function but rather a powerful conditional input to a specialized dual-stream architecture. It provides strong \textbf{dynamic guidance}, steering the model's predictions towards physically viable trajectories. In extensive benchmarks across meteorology, turbulence, and fire simulation, RAP not only surpasses state-of-the-art methods but also significantly outperforms a strong \textbf{analog-only forecasting baseline}. More importantly, RAP generates predictions that are more physically realistic by effectively suppressing error divergence in long-term rollouts.
LGDec 6, 2024
Zephyr quantum-assisted hierarchical Calo4pQVAE for particle-calorimeter interactionsIan Lu, Hao Jia, Sebastian Gonzalez et al.
With the approach of the High Luminosity Large Hadron Collider (HL-LHC) era set to begin particle collisions by the end of this decade, it is evident that the computational demands of traditional collision simulation methods are becoming increasingly unsustainable. Existing approaches, which rely heavily on first-principles Monte Carlo simulations for modeling event showers in calorimeters, are projected to require millions of CPU-years annually -- far exceeding current computational capacities. This bottleneck presents an exciting opportunity for advancements in computational physics by integrating deep generative models with quantum simulations. We propose a quantum-assisted hierarchical deep generative surrogate founded on a variational autoencoder (VAE) in combination with an energy conditioned restricted Boltzmann machine (RBM) embedded in the model's latent space as a prior. By mapping the topology of D-Wave's Zephyr quantum annealer (QA) into the nodes and couplings of a 4-partite RBM, we leverage quantum simulation to accelerate our shower generation times significantly. To evaluate our framework, we use Dataset 2 of the CaloChallenge 2022. Through the integration of classical computation and quantum simulation, this hybrid framework paves way for utilizing large-scale quantum simulations as priors in deep generative models.
LGMay 26, 2025
Advanced Long-term Earth System ForecastingHao Wu, Yuan Gao, Ruijian Gou et al.
Reliable long-term forecasting of Earth system dynamics is fundamentally limited by instabilities in current artificial intelligence (AI) models during extended autoregressive simulations. These failures often originate from inherent spectral bias, leading to inadequate representation of critical high-frequency, small-scale processes and subsequent uncontrolled error amplification. Inspired by the nested grids in numerical models used to resolve small scales, we present TritonCast. At the core of its design is a dedicated latent dynamical core, which ensures the long-term stability of the macro-evolution at a coarse scale. An outer structure then fuses this stable trend with fine-grained local details. This design effectively mitigates the spectral bias caused by cross-scale interactions. In atmospheric science, it achieves state-of-the-art accuracy on the WeatherBench 2 benchmark while demonstrating exceptional long-term stability: executing year-long autoregressive global forecasts and completing multi-year climate simulations that span the entire available $2500$-day test period without drift. In oceanography, it extends skillful eddy forecast to $120$ days and exhibits unprecedented zero-shot cross-resolution generalization. Ablation studies reveal that this performance stems from the synergistic interplay of the architecture's core components. TritonCast thus offers a promising pathway towards a new generation of trustworthy, AI-driven simulations. This significant advance has the potential to accelerate discovery in climate and Earth system science, enabling more reliable long-term forecasting and deeper insights into complex geophysical dynamics.
HCJan 28, 2022
Towards Multi-class Pre-movement ClassificationHao Jia, Zhe Sun, Feng Duan et al.
In non-invasive brain-computer interface systems, pre-movement decoding plays an important role in the detection of movement before limbs actually move. Movement-related cortical potential is a kind of brain activity associated with pre-movement decoding. In current studies, patterns decoded from movement are mainly applied to the binary classification between movement state and resting state, such as elbow flexion and rest. The classifications between two movement states and among multiple movement states are still challenging. This study proposes a new method, the star-arrangement spectral filtering (SASF), to solve the multi-class pre-movement classification problem. We first design a referenced task-related component analysis (RTRCA) framework that consists of two modules. This first module is the classification between movement state and resting state; the second module is the classification of multiple movement states. SASF is developed by optimizing the features in RTRCA. In SASF, feature selection on filter banks is used on the first module of RTRCA, and feature selection on time windows is used on the second module of RTRCA. A linear discriminant analysis classifier is used to classify the optimized features. In the binary classification between two motions, the classification accuracy of SASF achieves 0.9670$\pm$0.0522, which is significantly higher than the result provided by the deep convolutional neural network (0.6247$\pm$0.0680) and the discriminative spatial pattern method (0.4400$\pm$0.0700). In the multi-class classification of 7 states, the classification accuracy of SASF is 0.9491$\pm$0.0372. The proposed SASF greatly improves the classification between two motions and enables the classification among multiple motions. The result shows that the movement can be decoded from EEG signals before the actual limb movement.
HCJan 28, 2022
Improving Pre-movement Pattern Detection with Filter Bank SelectionHao Jia, Zhe Sun, Feng Duan et al.
Pre-movement decoding plays an important role in movement detection and is able to detect movement onset with low-frequency electroencephalogram (EEG) signals before the limb moves. In related studies, pre-movement decoding with standard task-related component analysis (STRCA) has been demonstrated to be efficient for classification between movement state and resting state. However, the accuracies of STRCA differ among subbands in the frequency domain. Due to individual differences, the best subband differs among subjects and is difficult to be determined. This study aims to improve the performance of the STRCA method by a feature selection on multiple subbands and avoid the selection of best subbands. This study first compares three frequency range settings ($M_1$: subbands with equally spaced bandwidths; $M_2$: subbands whose high cut-off frequencies are twice the low cut-off frequencies; $M_3$: subbands that start at some specific fixed frequencies and end at the frequencies in an arithmetic sequence.). Then, we develop a mutual information based technique to select the features in these subbands. A binary support vector machine classifier is used to classify the selected essential features. The results show that $M_3$ is a better setting than the other two settings. With the filter banks in $M_3$, the classification accuracy of the proposed FBTRCA achieves 0.8700$\pm$0.1022, which means a significantly improved performance compared to STRCA (0.8287$\pm$0.1101) as well as to the cross validation and testing method (0.8431$\pm$0.1078).
IVMay 17, 2021
Real-Time Video Super-Resolution on Smartphones with Deep Learning, Mobile AI 2021 Challenge: ReportAndrey Ignatov, Andres Romero, Heewon Kim et al.
Video super-resolution has recently become one of the most important mobile-related problems due to the rise of video communication and streaming services. While many solutions have been proposed for this task, the majority of them are too computationally expensive to run on portable devices with limited hardware resources. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based video super-resolution solutions that can achieve a real-time performance on mobile GPUs. The participants were provided with the REDS dataset and trained their models to do an efficient 4X video upscaling. The runtime of all models was evaluated on the OPPO Find X2 smartphone with the Snapdragon 865 SoC capable of accelerating floating-point networks on its Adreno GPU. The proposed solutions are fully compatible with any mobile GPU and can upscale videos to HD resolution at up to 80 FPS while demonstrating high fidelity results. A detailed description of all models developed in the challenge is provided in this paper.
IVMay 17, 2021
Real-Time Quantized Image Super-Resolution on Mobile NPUs, Mobile AI 2021 Challenge: ReportAndrey Ignatov, Radu Timofte, Maurizio Denna et al.
Image super-resolution is one of the most popular computer vision problems with many important applications to mobile devices. While many solutions have been proposed for this task, they are usually not optimized even for common smartphone AI hardware, not to mention more constrained smart TV platforms that are often supporting INT8 inference only. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based image super-resolution solutions that can demonstrate a real-time performance on mobile or edge NPUs. For this, the participants were provided with the DIV2K dataset and trained quantized models to do an efficient 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated NPU capable of accelerating quantized neural networks. The proposed solutions are fully compatible with all major mobile AI accelerators and are capable of reconstructing Full HD images under 40-60 ms while achieving high fidelity results. A detailed description of all models developed in the challenge is provided in this paper.
CLApr 15, 2021
Bilingual Terminology Extraction from Comparable E-Commerce CorporaHao Jia, Shuqin Gu, Yuqi Zhang et al.
Bilingual terminologies are important machine translation resources in the field of e-commerce, which are usually either manually translated or automatically extracted from parallel data. The human translation is costly and e-commerce parallel corpora is very scarce. However, the comparable data in different languages in the same commodity field is abundant. In this paper, we propose a novel framework of extracting e-commercial bilingual terminologies from comparable data. Benefiting from the cross-lingual pre-training in e-commerce, our framework can make full use of the deep semantic relationship between source-side terminology and target-side sentence to extract corresponding target terminology. Experimental results on various language pairs show that our approaches achieve significantly better performance than various strong baselines.
CLJul 6, 2020
Bilingual Dictionary Based Neural Machine Translation without Using Parallel SentencesXiangyu Duan, Baijun Ji, Hao Jia et al.
In this paper, we propose a new task of machine translation (MT), which is based on no parallel sentences but can refer to a ground-truth bilingual dictionary. Motivated by the ability of a monolingual speaker learning to translate via looking up the bilingual dictionary, we propose the task to see how much potential an MT system can attain using the bilingual dictionary and large scale monolingual corpora, while is independent on parallel sentences. We propose anchored training (AT) to tackle the task. AT uses the bilingual dictionary to establish anchoring points for closing the gap between source language and target language. Experiments on various language pairs show that our approaches are significantly better than various baselines, including dictionary-based word-by-word translation, dictionary-supervised cross-lingual word embedding transformation, and unsupervised MT. On distant language pairs that are hard for unsupervised MT to perform well, AT performs remarkably better, achieving performances comparable to supervised SMT trained on more than 4M parallel sentences.