CVJun 2
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline MatchingHao Zhong, Muzhi Zhu, Shenyan Zeng et al.
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.
CLJun 3
Streaming Communication in Multi-Agent ReasoningZhen Yang, Xiaogang Xu, Wen Wang et al.
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
CVMar 29, 2022Code
Coarse to Fine: Image Restoration Boosted by Multi-Scale Low-Rank Tensor CompletionRui Lin, Cong Chen, Ngai Wong
Existing low-rank tensor completion (LRTC) approaches aim at restoring a partially observed tensor by imposing a global low-rank constraint on the underlying completed tensor. However, such a global rank assumption suffers the trade-off between restoring the originally details-lacking parts and neglecting the potentially complex objects, making the completion performance unsatisfactory on both sides. To address this problem, we propose a novel and practical strategy for image restoration that restores the partially observed tensor in a coarse-to-fine (C2F) manner, which gets rid of such trade-off by searching proper local ranks for both low- and high-rank parts. Extensive experiments are conducted to demonstrate the superiority of the proposed C2F scheme. The codes are available at: https://github.com/RuiLin0212/C2FLRTC.
AIJan 23Code
LongCat-Flash-Thinking-2601 Technical ReportMeituan LongCat Team, Anchun Gui, Bei Li et al.
We introduce LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model with superior agentic reasoning capability. LongCat-Flash-Thinking-2601 achieves state-of-the-art performance among open-source models on a wide range of agentic benchmarks, including agentic search, agentic tool use, and tool-integrated reasoning. Beyond benchmark performance, the model demonstrates strong generalization to complex tool interactions and robust behavior under noisy real-world environments. Its advanced capability stems from a unified training framework that combines domain-parallel expert training with subsequent fusion, together with an end-to-end co-design of data construction, environments, algorithms, and infrastructure spanning from pre-training to post-training. In particular, the model's strong generalization capability in complex tool-use are driven by our in-depth exploration of environment scaling and principled task construction. To optimize long-tailed, skewed generation and multi-turn agentic interactions, and to enable stable training across over 10,000 environments spanning more than 20 domains, we systematically extend our asynchronous reinforcement learning framework, DORA, for stable and efficient large-scale multi-environment training. Furthermore, recognizing that real-world tasks are inherently noisy, we conduct a systematic analysis and decomposition of real-world noise patterns, and design targeted training procedures to explicitly incorporate such imperfections into the training process, resulting in improved robustness for real-world applications. To further enhance performance on complex reasoning tasks, we introduce a Heavy Thinking mode that enables effective test-time scaling by jointly expanding reasoning depth and width through intensive parallel thinking.
IVJun 18, 2023
The STOIC2021 COVID-19 AI challenge: applying reusable training methodologies to private dataLuuk H. Boulogne, Julian Lorenz, Daniel Kienzle et al.
Challenges drive the state-of-the-art of automated medical image analysis. The quantity of public training data that they provide can limit the performance of their solutions. Public access to the training methodology for these solutions remains absent. This study implements the Type Three (T3) challenge format, which allows for training solutions on private data and guarantees reusable training methodologies. With T3, challenge organizers train a codebase provided by the participants on sequestered training data. T3 was implemented in the STOIC2021 challenge, with the goal of predicting from a computed tomography (CT) scan whether subjects had a severe COVID-19 infection, defined as intubation or death within one month. STOIC2021 consisted of a Qualification phase, where participants developed challenge solutions using 2000 publicly available CT scans, and a Final phase, where participants submitted their training methodologies with which solutions were trained on CT scans of 9724 subjects. The organizers successfully trained six of the eight Final phase submissions. The submitted codebases for training and running inference were released publicly. The winning solution obtained an area under the receiver operating characteristic curve for discerning between severe and non-severe COVID-19 of 0.815. The Final phase solutions of all finalists improved upon their Qualification phase solutions.HSUXJM-TNZF9CHSUXJM-TNZF9C
CVApr 25, 2023
LEMaRT: Label-Efficient Masked Region Transform for Image HarmonizationSheng Liu, Cong Phuoc Huynh, Cong Chen et al.
We present a simple yet effective self-supervised pre-training method for image harmonization which can leverage large-scale unannotated image datasets. To achieve this goal, we first generate pre-training data online with our Label-Efficient Masked Region Transform (LEMaRT) pipeline. Given an image, LEMaRT generates a foreground mask and then applies a set of transformations to perturb various visual attributes, e.g., defocus blur, contrast, saturation, of the region specified by the generated mask. We then pre-train image harmonization models by recovering the original image from the perturbed image. Secondly, we introduce an image harmonization model, namely SwinIH, by retrofitting the Swin Transformer [27] with a combination of local and global self-attention mechanisms. Pre-training SwinIH with LEMaRT results in a new state of the art for image harmonization, while being label-efficient, i.e., consuming less annotated data for fine-tuning than existing methods. Notably, on iHarmony4 dataset [8], SwinIH outperforms the state of the art, i.e., SCS-Co [16] by a margin of 0.4 dB when it is fine-tuned on only 50% of the training data, and by 1.0 dB when it is trained on the full training dataset.
SYApr 16
Mean-Field Learning for Storage AggregationJingguan Liu, Cong Chen, Xiaomeng Ai et al.
Distributed energy storage devices can be aggregated to provide operational flexibility for power systems. This requires representing a massive device population as a single, tractable surrogate that is computationally efficient and accurate. However, surrogate identification is challenging due to heterogeneity, nonconvexity, and high dimensionality of storage devices. To address these challenges, this paper develops a mean-field learning framework for storage aggregation. We interpret aggregation as the average behavior of a large storage population and show that, as the population grows, aggregate performance converges to a unique, convex mean-field limit, enabling tractable population-level modeling. This convexity further yields a price-responsive characterization of aggregate storage behavior and allows us to bound the mean-field approximation error. We construct a convex surrogate model with physically interpretable parameters that approximates the aggregate behavior of large storage populations and can be embedded directly into power system operations. Surrogate parameter identification is formulated as an optimization problem using historical price-response data, and we adopt a gradient-based algorithm for efficient learning. Case studies validate the theoretical findings and demonstrate the effectiveness of the proposed framework in approximation accuracy and data efficiency.
OCMay 25
A Scalable Bundle Method for Exact Reformulation of SDP in Three-Phase Power Flow FeasibilityBohang Fang, Lijun Ding, Cong Chen
Power flow feasibility assessment is computationally challenging for unbalanced three-phase distribution networks. This paper develops a vectorized semidefinite program (SDP) based on the bus injection model (BIM) and reformulates its dual as an exact-penalty problem, enabling us to develop a scalable three-cut proximal bundle method for feasibility assessment. The proposed bundle method is numerically over 400 times faster than MOSEK with less than 1/2000 of its memory; on the decomposed BIM-SDP, approximately 2 times faster with 75% less memory.
SYApr 14
Wholesale Market Participation via Competitive DER AggregationCong Chen, Ahmed S. Alahmed, Timothy D. Mount et al.
We consider the aggregation of distributed energy resources (DERs), such as solar PV, energy storage, and flexible loads, by a profit-seeking aggregator participating directly in the wholesale market under distribution network access constraints. We propose a competitive DER aggregator (DERA) model that directly controls local DERs to maximize its profits, while ensuring each aggregated customer gains a surplus higher than their surplus under the regulated retail tariff. The DERA participates in the wholesale electricity market as virtual storage with optimized generation offers and consumption bids derived from the propoed competitive aggregation model. Also derived are DERA's bid curves for the distribution network access and DERA's profitability when competing with the regulated retail tariff. We show that, with the same distribution network access, the proposed DERA's wholesale market participation achieves the same welfare-maximizing outcome as when its customers participate directly in the wholesale market. Extensive numerical studies compare the proposed DERA with existing methods in terms of customer surplus and DERA profit. We empirically evaluate how many DERAs can survive in the competition at long-run equilibrium, and assess the impacts of DER adoption levels and distribution network access on short-run operations.
LGMar 7, 2025Code
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUsLing Team, Binwei Zeng, Chao Huang et al.
In this technical report, we tackle the challenges of training large-scale Mixture of Experts (MoE) models, focusing on overcoming cost inefficiency and resource limitations prevalent in such systems. To address these issues, we present two differently sized MoE large language models (LLMs), namely Ling-Lite and Ling-Plus (referred to as "Bailing" in Chinese, spelled Bǎilíng in Pinyin). Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. Both models exhibit comparable performance to leading industry benchmarks. This report offers actionable insights to improve the efficiency and accessibility of AI development in resource-constrained settings, promoting more scalable and sustainable technologies. Specifically, to reduce training costs for large-scale MoE models, we propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency. Additionally, leveraging high-quality data generated from knowledge graphs, our models demonstrate superior capabilities in tool use compared to other models. Ultimately, our experimental findings demonstrate that a 300B MoE LLM can be effectively trained on lower-performance devices while achieving comparable performance to models of a similar scale, including dense and MoE models. Compared to high-performance devices, utilizing a lower-specification hardware system during the pre-training phase demonstrates significant cost savings, reducing computing costs by approximately 20%. The models can be accessed at https://huggingface.co/inclusionAI.
CLSep 1, 2025Code
LongCat-Flash Technical ReportMeituan LongCat Team, Bayan, Bei Li et al.
We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat
MAApr 1
Competition and Cooperation of LLM Agents in GamesJiayi Yao, Cong Chen, Baosen Zhang
Large language model (LLM) agents are increasingly deployed in competitive multi-agent settings, raising fundamental questions about whether they converge to equilibria and how their strategic behavior can be characterized. In this paper, we study LLM agent interactions in two standard games: a network resource allocation game and a Cournot competition game. Rather than converging to Nash equilibria, we find that LLM agents tend to cooperate when given multi-round prompts and non-zero-sum context. Chain-of-thought analysis reveals that fairness reasoning is central to this behavior. We propose an analytical framework that captures the dynamics of LLM agent reasoning across rounds and explains these experimental findings.
CRMay 14
The Great Pretender: A Stochasticity Problem in LLM JailbreakJean-Philippe Monteuuis, Cong Chen, Jonathan Petit
"Oh-Oh, yes, I'm the great pretender. Pretending that I'm doing well. My need is such, I pretend too much..." summarizes the state in the area of jailbreak creation and evaluation. You find this method to generate adversarial attacks proposed by a reputable institution (e.g., BoN from Anthropic or Crescendo from Microsoft Research). However, this method does not deliver on the promise claimed in the paper despite having top ASR scores against industry-grade LLMs. You successfully generate the jailbreak prompts against your target (open) model. However, the generated jailbreak prompt works against the target model with a 50% consecutive success rate (5 out of 10 attempts) despite having an 80% ASR (on paper) on the latest closed-source model (with a guardrail system)! This observation leads us to think. First, Attack Success Rate (ASR), the primary metric for LLM jailbreak benchmarking, is not a stable quantity. Second, published ASR numbers are therefore systematically inflated and incomparable across papers. Therefore, we wonder "Why a successful jailbreak prompt does not perform consistently well against a target model on which the prompts have been optimized?". To answer this question, we study the impact of stochasticity not only during attack evaluation but also during attack generation. Our evaluation includes several jailbreak attacks, models (different sizes and providers), and judges. In addition, we propose a new metric and two new frameworks (CAS-eval and CAS-gen). Our evaluation framework, CAS-eval, shows that an attack can have an ASR drop of up to 30 percentage points when a jailbreak prompt needs to succeed on more than one attempt. Thankfully, our attack generation framework (CAS-gen) improves previous jailbreak methods and helps them recover this loss of 30 percentage points!
SYMay 13
Reachable-Set Decomposition for Real-Time Aggregation of Multi-Zone HVAC FleetsJingguan Liu, Xiaomeng Ai, Cong Chen et al.
Aggregating building heating, ventilation, and air-conditioning (HVAC) fleets provides substantial real-time flexibility to power system operations. However, real-time aggregation of multi-zone HVAC fleets faces two key challenges: (i) strong coupling across zones and time makes flexibility characterization high-dimensional and computationally demanding, and (ii) the sequential revelation of temperature states and exogenous conditions requires that decisions made at each period preserve feasibility over the remaining horizon using only currently realized information. To address these challenges, this paper proposes a reachable-set decomposition framework comprising an offline decomposition stage and a real-time policy. In the offline stage, backward reachable sets are formulated to encode remaining-horizon feasibility into per-period state constraints, so that any state within the current reachable set is guaranteed to sustain feasible operation over the entire remaining horizon. A tailored inner approximation is then developed for tractable calculation in multi-zone-coupled HVAC settings. In the real-time stage, aggregate flexibility is computed efficiently via building-level parallel linear programs followed by closed-form Minkowski summation of power intervals, and any regulation signal within the reported flexibility interval admits a recursively feasible disaggregation. Case studies demonstrate the effectiveness of the proposed framework in aggregate flexibility characterization, disaggregation feasibility, and scalable computation.
AISep 23, 2025Code
Introducing LongCat-Flash-Thinking: A Technical ReportMeituan LongCat Team, Anchun Gui, Bei Li et al.
We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.
SDJun 22, 2024Code
The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language ModelsJiajia Li, Lu Yang, Mingni Tang et al.
Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-related capabilities of LLMs. ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. By leveraging ZIQI-Eval, we conduct a comprehensive evaluation over 16 LLMs to evaluate and analyze LLMs' performance in the domain of music. Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities. With ZIQI-Eval, we aim to provide a standardized and robust evaluation framework that facilitates a comprehensive assessment of LLMs' music-related abilities. The dataset is available at GitHub\footnote{https://github.com/zcli-charlie/ZIQI-Eval} and HuggingFace\footnote{https://huggingface.co/datasets/MYTH-Lab/ZIQI-Eval}.
CVOct 21, 2021Code
A Deep Insight into Measuring Face Image Utility with General and Face-specific Image Quality MetricsBiying Fu, Cong Chen, Olaf Henniger et al.
Quality scores provide a measure to evaluate the utility of biometric samples for biometric recognition. Biometric recognition systems require high-quality samples to achieve optimal performance. This paper focuses on face images and the measurement of face image utility with general and face-specific image quality metrics. While face-specific metrics rely on features of aligned face images, general image quality metrics can be used on the global image and relate to human perceptions. In this paper, we analyze the gap between the general image quality metrics and the face image quality metrics. Our contribution lies in a thorough examination of how different the image quality assessment algorithms relate to the utility for the face recognition task. The results of image quality assessment algorithms are further compared with those of dedicated face image quality assessment algorithms. In total, 25 different quality metrics are evaluated on three face image databases, BioSecure, LFW, and VGGFace2 using three open-source face recognition solutions, SphereFace, ArcFace, and FaceNet. Our results reveal a clear correlation between learned image metrics to face image utility even without being specifically trained as a face utility measure. Individual handcrafted features lack general stability and perform significantly worse than general face-specific quality metrics. We additionally provide a visual insight into the image areas contributing to the quality score of a selected set of quality assessment methods.
CVJul 13, 2020Code
Temporal Self-Ensembling Teacher for Semi-Supervised Object DetectionCong Chen, Shouyang Dong, Ye Tian et al.
This paper focuses on Semi-Supervised Object Detection (SSOD). Knowledge Distillation (KD) has been widely used for semi-supervised image classification. However, adapting these methods for SSOD has the following obstacles. (1) The teacher model serves a dual role as a teacher and a student, such that the teacher predictions on unlabeled images may be very close to those of student, which limits the upper-bound of the student. (2) The class imbalance issue in SSOD hinders an efficient knowledge transfer from teacher to student. To address these problems, we propose a novel method Temporal Self-Ensembling Teacher (TSE-T) for SSOD. Differently from previous KD based methods, we devise a temporally evolved teacher model. First, our teacher model ensembles its temporal predictions for unlabeled images under stochastic perturbations. Second, our teacher model ensembles its temporal model weights with the student model weights by an exponential moving average (EMA) which allows the teacher gradually learn from the student. These self-ensembling strategies increase data and model diversity, thus improving teacher predictions on unlabeled images. Finally, we use focal loss to formulate consistency regularization term to handle the data imbalance problem, which is a more efficient manner to utilize the useful information from unlabeled images than a simple hard-thresholding method which solely preserves confident predictions. Evaluated on the widely used VOC and COCO benchmarks, the mAP of our method has achieved 80.73% and 40.52% on the VOC2007 test set and the COCO2014 minval5k set respectively, which outperforms a strong fully-supervised detector by 2.37% and 1.49%. Furthermore, our method sets the new state-of-the-art in SSOD on VOC2007 test set which outperforms the baseline SSOD method by 1.44%. The source code of this work is publicly available at http://github.com/syangdong/tse-t.
SYApr 10
Risk-Aware Allocation of Transmission Capacity for AI Data CentersShaoze Li, Bohang Fang, Cong Chen
Rapid growth in AI-driven data center loads is creating significant challenges for transmission grid interconnection. This paper proposes robust and risk-aware frameworks to quantify transmission capacity as firm and flexible capacities. We efficiently solve the robust optimization problem to determine firm capacity when minimizing unserved data center demand. Building upon this, we introduce a risk-aware allocation for flexible capacity, showing that tolerating a minimal probability of service interruption and blackout can unlock substantial flexible capacity of transmission networks and accelerate data center interconnection. To efficiently allocate scarce transmission capacities among competing data centers, we adopt the simultaneous ascending auction, characterizing products by capacity, risk level, and location. Under additive or symmetric concave valuation functions, the auction converges to a competitive equilibrium and achieves efficient allocation.
CVMar 9, 2025
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual TrainingCong Chen, Mingyu Liu, Chenchen Jing et al.
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.
LGApr 8
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple ActionsGuo Gan, Yuxuan Ding, Cong Chen et al.
Online reinforcement learning (RL) serves as an effective method for enhancing the capabilities of Android agents. However, guiding agents to learn through online interaction is prohibitively expensive due to the high latency of emulators and the sample inefficiency of existing RL algorithms. We identify a fundamental limitation in current approaches: the Single State Single Action paradigm, which updates the policy with one-to-one state-action pairs from online one-way rollouts without fully exploring each costly emulator state. In this paper, we propose Android Coach, a novel framework that shifts the training paradigm to Single State Multiple Actions, allowing the agent to sample and utilize multiple actions for a single online state. We enable this without additional emulator overhead by learning a critic that estimates action values. To ensure the critic serves as a reliable coach, we integrate a process reward model and introduce a group-wise advantage estimator based on the averaged critic outputs. Extensive experiments demonstrate the effectiveness and efficiency of Android Coach: it achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than Single State Single Action methods PPO and GRPO at matched success rates.
AIJan 14
AviationLMM: A Large Multimodal Foundation Model for Civil AviationWenbin Li, Jingling Wu, Xiaoyong Lin. Jing Chen et al.
Civil aviation is a cornerstone of global transportation and commerce, and ensuring its safety, efficiency and customer satisfaction is paramount. Yet conventional Artificial Intelligence (AI) solutions in aviation remain siloed and narrow, focusing on isolated tasks or single modalities. They struggle to integrate heterogeneous data such as voice communications, radar tracks, sensor streams and textual reports, which limits situational awareness, adaptability, and real-time decision support. This paper introduces the vision of AviationLMM, a Large Multimodal foundation Model for civil aviation, designed to unify the heterogeneous data streams of civil aviation and enable understanding, reasoning, generation and agentic applications. We firstly identify the gaps between existing AI solutions and requirements. Secondly, we describe the model architecture that ingests multimodal inputs such as air-ground voice, surveillance, on-board telemetry, video and structured texts, and performs cross-modal alignment and fusion, and produces flexible outputs ranging from situation summaries and risk alerts to predictive diagnostics and multimodal incident reconstructions. In order to fully realize this vision, we identify key research opportunities to address, including data acquisition, alignment and fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation. By articulating the design and challenges of AviationLMM, we aim to boost the civil aviation foundation model progress and catalyze coordinated research efforts toward an integrated, trustworthy and privacy-preserving aviation AI ecosystem.
DBDec 11, 2024
Efficient Dynamic Attributed Graph GenerationFan Li, Xiaoyang Wang, Dawei Cheng et al.
Data generation is a fundamental research problem in data management due to its diverse use cases, ranging from testing database engines to data-specific applications. However, real-world entities often involve complex interactions that cannot be effectively modeled by traditional tabular data. Therefore, graph data generation has attracted increasing attention recently. Although various graph generators have been proposed in the literature, there are three limitations: i) They cannot capture the co-evolution pattern of graph structure and node attributes. ii) Few of them consider edge direction, leading to substantial information loss. iii) Current state-of-the-art dynamic graph generators are based on the temporal random walk, making the simulation process time-consuming. To fill the research gap, we introduce VRDAG, a novel variational recurrent framework for efficient dynamic attributed graph generation. Specifically, we design a bidirectional message-passing mechanism to encode both directed structural knowledge and attribute information of a snapshot. Then, the temporal dependency in the graph sequence is captured by a recurrence state updater, generating embeddings that can preserve the evolution pattern of early graphs. Based on the hidden node embeddings, a conditional variational Bayesian method is developed to sample latent random variables at the neighboring timestep for new snapshot generation. The proposed generation paradigm avoids the time-consuming path sampling and merging process in existing random walk-based methods, significantly reducing the synthesis time. Finally, comprehensive experiments on real-world datasets are conducted to demonstrate the effectiveness and efficiency of the proposed model.
CVApr 8
TC-AE: Unlocking Token Capacity for Deep Compression AutoencodersTeng Li, Ziyuan Huang, Cong Chen et al.
We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.
CVSep 28, 2025
HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and GenerationCong Chen, Ziyuan Huang, Cheng Zou et al.
In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2\% improvement in rFID ($1.47 \rightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38\times$ faster convergence rate and an 18.9\% boost in gFID ($16.4 \rightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.
AISep 28, 2025
GUI-Shepherd: Reliable Process Reward and Verification for Long-Sequence GUI TasksCong Chen, Kaixiang Ji, Hao Zhong et al.
Autonomous agents for long-sequence Graphical User Interface tasks are hindered by sparse rewards and the intractable credit assignment problem. To address these challenges, we introduce GUI-Shepherd, a Process Reward Model that provides dense, step-by-step feedback to guide agents. GUI-Shepherd is trained on a diverse large-scale data set of $52$k interactions that features human-annotated scores and GPT-4o generated rationales, enabling it to serve both as a reward provider for RL training and as a verifier for inference. As far as we know, we are the first to conduct a systematic study of process supervision in GUI agents, across diverse settings from online long-horizon tasks to offline single-step prediction. On the online AndroidWorld benchmark, GUI-Shepherd improves success rate by $7.7$ points via multi-turn online PPO, significantly outperforming Outcome Reward Model based competitors. When used as an inference verifier, it brings $5.1$ points improvements. The benefits generalize to the offline AndroidControl benchmark, with gains of $2.2$ points as a reward provider and $4.3$ points as a verifier. Collectively, our results establish that high-fidelity process supervision is critical for building more capable GUI agents and present a generalizable solution.
LGSep 19, 2025
Communications to Circulations: Real-Time 3D Wind Field Prediction Using 5G GNSS Signals and Deep LearningYuchen Ye, Chaoxia Yuan, Mingyu Li et al.
Accurate atmospheric wind field information is crucial for various applications, including weather forecasting, aviation safety, and disaster risk reduction. However, obtaining high spatiotemporal resolution wind data remains challenging due to limitations in traditional in-situ observations and remote sensing techniques, as well as the computational expense and biases of numerical weather prediction (NWP) models. This paper introduces G-WindCast, a novel deep learning framework that leverages signal strength variations from 5G Global Navigation Satellite System (GNSS) signals to forecast three-dimensional (3D) atmospheric wind fields. The framework utilizes Forward Neural Networks (FNN) and Transformer networks to capture complex, nonlinear, and spatiotemporal relationships between GNSS-derived features and wind dynamics. Our preliminary results demonstrate promising accuracy in real-time wind forecasts (up to 30 minutes lead time). The model exhibits robustness across forecast horizons and different pressure levels, and its predictions for wind fields show superior agreement with ground-based radar wind profiler compared to concurrent European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5). Furthermore, we show that the system can maintain excellent performance for localized forecasting even with a significantly reduced number of GNSS stations (e.g., around 100), highlighting its cost-effectiveness and scalability. This interdisciplinary approach underscores the transformative potential of exploiting non-traditional data sources and deep learning for advanced environmental monitoring and real-time atmospheric applications.
IVJul 21, 2025
A Steel Surface Defect Detection Method Based on Lightweight Convolution OptimizationCong Chen, Ming Chen, Hoileong Lee et al.
Surface defect detection of steel, especially the recognition of multi-scale defects, has always been a major challenge in industrial manufacturing. Steel surfaces not only have defects of various sizes and shapes, which limit the accuracy of traditional image processing and detection methods in complex environments. However, traditional defect detection methods face issues of insufficient accuracy and high miss-detection rates when dealing with small target defects. To address this issue, this study proposes a detection framework based on deep learning, specifically YOLOv9s, combined with the C3Ghost module, SCConv module, and CARAFE upsampling operator, to improve detection accuracy and model performance. First, the SCConv module is used to reduce feature redundancy and optimize feature representation by reconstructing the spatial and channel dimensions. Second, the C3Ghost module is introduced to enhance the model's feature extraction ability by reducing redundant computations and parameter volume, thereby improving model efficiency. Finally, the CARAFE upsampling operator, which can more finely reorganize feature maps in a content-aware manner, optimizes the upsampling process and ensures detailed restoration of high-resolution defect regions. Experimental results demonstrate that the proposed model achieves higher accuracy and robustness in steel surface defect detection tasks compared to other methods, effectively addressing defect detection problems.
SPJul 21, 2025
EEG-based Epileptic Prediction via a Two-stage Channel-aware Set Transformer NetworkRuifeng Zheng, Cong Chen, Shuang Wang et al.
Epilepsy is a chronic, noncommunicable brain disorder, and sudden seizure onsets can significantly impact patients' quality of life and health. However, wearable seizure-predicting devices are still limited, partly due to the bulky size of EEG-collecting devices. To relieve the problem, we proposed a novel two-stage channel-aware Set Transformer Network that could perform seizure prediction with fewer EEG channel sensors. We also tested a seizure-independent division method which could prevent the adjacency of training and test data. Experiments were performed on the CHB-MIT dataset which includes 22 patients with 88 merged seizures. The mean sensitivity before channel selection was 76.4% with a false predicting rate (FPR) of 0.09/hour. After channel selection, dominant channels emerged in 20 out of 22 patients; the average number of channels was reduced to 2.8 from 18; and the mean sensitivity rose to 80.1% with an FPR of 0.11/hour. Furthermore, experimental results on the seizure-independent division supported our assertion that a more rigorous seizure-independent division should be used for patients with abundant EEG recordings.
AIJun 14, 2025
Behavioral Generative Agents for Energy OperationsCong Chen, Omer Karaduman, Xu Kuang
Accurately modeling consumer behavior in energy operations remains challenging due to inherent uncertainties, behavioral complexities, and limited empirical data. This paper introduces a novel approach leveraging generative agents--artificial agents powered by large language models--to realistically simulate customer decision-making in dynamic energy operations. We demonstrate that these agents behave more optimally and rationally in simpler market scenarios, while their performance becomes more variable and suboptimal as task complexity rises. Furthermore, the agents exhibit heterogeneous customer preferences, consistently maintaining distinct, persona-driven reasoning patterns. Our findings highlight the potential value of integrating generative agents into energy management simulations to improve the design and effectiveness of energy policies and incentive programs.
CVFeb 4, 2025
Uncertainty Quantification for Collaborative Object Detection Under Adversarial AttacksHuiqun Huang, Cong Chen, Jean-Philippe Monteuuis et al.
Collaborative Object Detection (COD) and collaborative perception can integrate data or features from various entities, and improve object detection accuracy compared with individual perception. However, adversarial attacks pose a potential threat to the deep learning COD models, and introduce high output uncertainty. With unknown attack models, it becomes even more challenging to improve COD resiliency and quantify the output uncertainty for highly dynamic perception scenes such as autonomous vehicles. In this study, we propose the Trusted Uncertainty Quantification in Collaborative Perception framework (TUQCP). TUQCP leverages both adversarial training and uncertainty quantification techniques to enhance the adversarial robustness of existing COD models. More specifically, TUQCP first adds perturbations to the shared information of randomly selected agents during object detection collaboration by adversarial training. TUQCP then alleviates the impacts of adversarial attacks by providing output uncertainty estimation through learning-based module and uncertainty calibration through conformal prediction. Our framework works for early and intermediate collaboration COD models and single-agent object detection models. We evaluate TUQCP on V2X-Sim, a comprehensive collaborative perception dataset for autonomous driving, and demonstrate a 80.41% improvement in object detection accuracy compared to the baselines under the same adversarial attacks. TUQCP demonstrates the importance of uncertainty quantification to COD under adversarial attacks.
CRDec 3, 2021
V2X Misbehavior and Collective Perception Service: Considerations for StandardizationMohammad Raashid Ansari, Jean-Philippe Monteuuis, Jonathan Petit et al.
Connected and Automated Vehicles use sensors and wireless communication to improve road safety and efficiency. However, attackers may target Vehicle-to-Everything communication. Indeed, an attacker may send authenticated but wrong data to send false location information, alert incorrect events or report a bogus object endangering the safety of other CAVs. Currently, Standardization Development Organizations are working on developing security standards against such attacks. Unfortunately, current standardization efforts do not include misbehavior specifications for advanced V2X services such as Collective Perception yet. This work assesses the security of Collective Perception Messages and proposes inputs for consideration in existing standards.
CVAug 15, 2021
Two Eyes Are Better Than One: Exploiting Binocular Correlation for Diabetic Retinopathy Severity GradingPeisheng Qian, Ziyuan Zhao, Cong Chen et al.
Diabetic retinopathy (DR) is one of the most common eye conditions among diabetic patients. However, vision loss occurs primarily in the late stages of DR, and the symptoms of visual impairment, ranging from mild to severe, can vary greatly, adding to the burden of diagnosis and treatment in clinical practice. Deep learning methods based on retinal images have achieved remarkable success in automatic DR grading, but most of them neglect that the presence of diabetes usually affects both eyes, and ophthalmologists usually compare both eyes concurrently for DR diagnosis, leaving correlations between left and right eyes unexploited. In this study, simulating the diagnostic process, we propose a two-stream binocular network to capture the subtle correlations between left and right eyes, in which, paired images of eyes are fed into two identical subnetworks separately during training. We design a contrastive grading loss to learn binocular correlation for five-class DR detection, which maximizes inter-class dissimilarity while minimizing the intra-class difference. Experimental results on the EyePACS dataset show the superiority of the proposed binocular model, outperforming monocular methods by a large margin.
CVJul 27, 2020
The Effect of Wearing a Mask on Face Recognition Performance: an Exploratory StudyNaser Damer, Jonas Henry Grebe, Cong Chen et al.
Face recognition has become essential in our daily lives as a convenient and contactless method of accurate identity verification. Process such as identity verification at automatic border control gates or the secure login to electronic devices are increasingly dependant on such technologies. The recent COVID-19 pandemic have increased the value of hygienic and contactless identity verification. However, the pandemic led to the wide use of face masks, essential to keep the pandemic under control. The effect of wearing a mask on face recognition in a collaborative environment is currently sensitive yet understudied issue. We address that by presenting a specifically collected database containing three session, each with three different capture instructions, to simulate realistic use cases. We further study the effect of masked face probes on the behaviour of three top-performing face recognition systems, two academic solutions and one commercial off-the-shelf (COTS) system.
LGFeb 28, 2020
HOTCAKE: Higher Order Tucker Articulated Kernels for Deeper CNN CompressionRui Lin, Ching-Yun Ko, Zhuolun He et al.
The emerging edge computing has promoted immense interests in compacting a neural network without sacrificing much accuracy. In this regard, low-rank tensor decomposition constitutes a powerful tool to compress convolutional neural networks (CNNs) by decomposing the 4-way kernel tensor into multi-stage smaller ones. Building on top of Tucker-2 decomposition, we propose a generalized Higher Order Tucker Articulated Kernels (HOTCAKE) scheme comprising four steps: input channel decomposition, guided Tucker rank selection, higher order Tucker decomposition and fine-tuning. By subjecting each CONV layer to HOTCAKE, a highly compressed CNN model with graceful accuracy trade-off is obtained. Experiments show HOTCAKE can compress even pre-compressed models and produce state-of-the-art lightweight networks.
LGJan 2, 2020
Kernelized Support Tensor Train MachinesCong Chen, Kim Batselier, Wenjian Yu et al.
Tensor, a multi-dimensional data structure, has been exploited recently in the machine learning community. Traditional machine learning approaches are vector- or matrix-based, and cannot handle tensorial data directly. In this paper, we propose a tensor train (TT)-based kernel technique for the first time, and apply it to the conventional support vector machine (SVM) for image classification. Specifically, we propose a kernelized support tensor train machine that accepts tensorial input and preserves the intrinsic kernel property. The main contributions are threefold. First, we propose a TT-based feature mapping procedure that maintains the TT structure in the feature space. Second, we demonstrate two ways to construct the TT-based kernel function while considering consistency with the TT inner product and preservation of information. Third, we show that it is possible to apply different kernel functions on different data modes. In principle, our method tensorizes the standard SVM on its input structure and kernel mapping scheme. Extensive experiments are performed on real-world tensor data, which demonstrates the superiority of the proposed scheme under few-sample high-dimensional inputs.
CVNov 12, 2018
Matrix Product Operator Restricted Boltzmann MachinesCong Chen, Kim Batselier, Ching-Yun Ko et al.
A restricted Boltzmann machine (RBM) learns a probability distribution over its input samples and has numerous uses like dimensionality reduction, classification and generative modeling. Conventional RBMs accept vectorized data that dismisses potentially important structural information in the original tensor (multi-way) input. Matrix-variate and tensor-variate RBMs, named MvRBM and TvRBM, have been proposed but are all restrictive by model construction, which leads to a weak model expression power. This work presents the matrix product operator RBM (MPORBM) that utilizes a tensor network generalization of Mv/TvRBM, preserves input formats in both the visible and hidden layers, and results in higher expressive power. A novel training algorithm integrating contrastive divergence and an alternating optimization procedure is also developed. Numerical experiments compare the MPORBM with the traditional RBM and MvRBM for data classification and image completion and denoising tasks. The expressive power of the MPORBM as a function of the MPO-rank is also investigated.
LGNov 9, 2018
Deep Compression of Sum-Product Networks on Tensor NetworksChing-Yun Ko, Cong Chen, Yuke Zhang et al.
Sum-product networks (SPNs) represent an emerging class of neural networks with clear probabilistic semantics and superior inference speed over graphical models. This work reveals a strikingly intimate connection between SPNs and tensor networks, thus leading to a highly efficient representation that we call tensor SPNs (tSPNs). For the first time, through mapping an SPN onto a tSPN and employing novel optimization techniques, we demonstrate remarkable parameter compression with negligible loss in accuracy.
LGApr 17, 2018
A Support Tensor Train MachineCong Chen, Kim Batselier, Ching-Yun Ko et al.
There has been growing interest in extending traditional vector-based machine learning techniques to their tensor forms. An example is the support tensor machine (STM) that utilizes a rank-one tensor to capture the data structure, thereby alleviating the overfitting and curse of dimensionality problems in the conventional support vector machine (SVM). However, the expressive power of a rank-one tensor is restrictive for many real-world data. To overcome this limitation, we introduce a support tensor train machine (STTM) by replacing the rank-one tensor in an STM with a tensor train. Experiments validate and confirm the superiority of an STTM over the SVM and STM.