Jian Gao

CV
h-index24
45papers
696citations
Novelty48%
AI Score57

45 Papers

CVJul 16, 2024Code
Sliding Gaussian ball adaptive growth (SlingBAG): point cloud-based iterative algorithm for large-scale 3D photoacoustic imaging

Shuang Li, Yibing Wang, Jian Gao et al.

Large-scale 3D photoacoustic (PA) imaging has become increasingly important for both clinical and pre-clinical applications. Limited by cost and system complexity, only systems with sparsely-distributed sensors can be widely implemented, which desires advanced reconstruction algorithms to reduce artifacts. However, high computing memory and time consumption of traditional iterative reconstruction (IR) algorithms is practically unacceptable for large-scale 3D PA imaging. Here, we propose a point cloud-based IR algorithm that reduces memory consumption by several orders, wherein the 3D PA scene is modeled as a series of Gaussian-distributed spherical sources stored in form of point cloud. During the IR process, not only are properties of each Gaussian source, including its peak intensity (initial pressure value), standard deviation (size) and mean (position) continuously optimized, but also each Gaussian source itself adaptively undergoes destroying, splitting, and duplication along the gradient direction. This method, named the sliding Gaussian ball adaptive growth (SlingBAG) algorithm, enables high-quality large-scale 3D PA reconstruction with fast iteration and extremely low memory usage. We validated SlingBAG algorithm in both simulation study and in vivo animal experiments. The source code and data for SlingBAG, along with supplementary materials and demonstration videos, are now available in the following GitHub repository: https://github.com/JaegerCQ/SlingBAG.

CVMar 23Code
PA-SFM: Tracker-free differentiable acoustic radiation for freehand 3D photoacoustic imaging

Shuang Li, Jian Gao, Chulhong Kim et al.

Three-dimensional (3D) handheld photoacoustic tomography typically relies on bulky and expensive external positioning sensors to correct motion artifacts, which severely limits its clinical flexibility and accessibility. To address this challenge, we present PA-SFM, a tracker-free framework that leverages exclusively single-modality photoacoustic data for both sensor pose recovery and high-fidelity 3D reconstruction via differentiable acoustic radiation modeling. Unlike traditional structure-from-motion (SFM) methods based on visual features, PA-SFM integrates the acoustic wave equation into a differentiable programming pipeline. By leveraging a high-performance, GPU-accelerated acoustic radiation kernel, the framework simultaneously optimizes the 3D photoacoustic source distribution and the sensor array pose via gradient descent. To ensure robust convergence in freehand scenarios, we introduce a coarse-to-fine optimization strategy that incorporates geometric consistency checks and rigid-body constraints to eliminate motion outliers. We validated the proposed method through both numerical simulations and in-vivo rat experiments. The results demonstrate that PA-SFM achieves sub-millimeter positioning accuracy and restores high-resolution 3D vascular structures comparable to ground-truth benchmarks, offering a low-cost, software-defined solution for clinical freehand photoacoustic imaging. The source code is publicly available at \href{https://github.com/JaegerCQ/PA-SFM}{https://github.com/JaegerCQ/PA-SFM}.

ARDec 27, 2025Code
AnalogSAGE: Self-evolving Analog Design Multi-Agents with Stratified Memory and Grounded Experience

Zining Wang, Jian Gao, Weimin Fu et al.

Analog circuit design remains a knowledge- and experience-intensive process that relies heavily on human intuition for topology generation and device parameter tuning. Existing LLM-based approaches typically depend on prompt-driven netlist generation or predefined topology templates, limiting their ability to satisfy complex specification requirements. We propose AnalogSAGE, an open-source self-evolving multi-agent framework that coordinates three-stage agent explorations through four stratified memory layers, enabling iterative refinement with simulation-grounded feedback. To support reproducibility and generality, we release the source code. Our benchmark spans ten specification-driven operational amplifier design problems of varying difficulty, enabling quantitative and cross-task comparison under identical conditions. Evaluated under the open-source SKY130 PDK with ngspice, AnalogSAGE achieves a 10$\times$ overall pass rate, a 48$\times$ Pass@1, and a 4$\times$ reduction in parameter search space compared with existing frameworks, demonstrating that stratified memory and grounded reasoning substantially enhance the reliability and autonomy of analog design automation in practice.

CVApr 16
Geometric Context Transformer for Streaming 3D Reconstruction

Lin-Zhuo Chen, Jian Gao, Yihang Chen et al.

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

CVJan 2Code
SlingBAG Pro: Accelerating point cloud-based iterative reconstruction for 3D photoacoustic imaging with arbitrary array geometries

Shuang Li, Yibing Wang, Jian Gao et al.

High-quality three-dimensional (3D) photoacoustic imaging (PAI) is gaining increasing attention in clinical applications. To address the challenges of limited space and high costs, irregular geometric transducer arrays that conform to specific imaging regions are promising for achieving high-quality 3D PAI with fewer transducers. However, traditional iterative reconstruction algorithms struggle with irregular array configurations, suffering from high computational complexity, substantial memory requirements, and lengthy reconstruction times. In this work, we introduce SlingBAG Pro, an advanced reconstruction algorithm based on the point cloud iteration concept of the Sliding ball adaptive growth (SlingBAG) method, while extending its compatibility to arbitrary array geometries. SlingBAG Pro maintains high reconstruction quality, reduces the number of required transducers, and employs a hierarchical optimization strategy that combines zero-gradient filtering with progressively increased temporal sampling rates during iteration. This strategy rapidly removes redundant spatial point clouds, accelerates convergence, and significantly shortens overall reconstruction time. Compared to the original SlingBAG algorithm, SlingBAG Pro achieves up to a 2.2-fold speed improvement in point cloud-based 3D PA reconstruction under irregular array geometries. The proposed method is validated through both simulation and in vivo mouse experiments, and the source code is publicly available at https://github.com/JaegerCQ/SlingBAG_Pro.

DLApr 17, 2023
Quantifying the Benefit of Artificial Intelligence for Scientific Research

Jian Gao, Dashun Wang

The ongoing artificial intelligence (AI) revolution has the potential to change almost every line of work. As AI capabilities continue to improve in accuracy, robustness, and reach, AI may outperform and even replace human experts across many valuable tasks. Despite enormous effort devoted to understanding the impact of AI on labor and the economy and AI's recent successes in accelerating scientific discovery and progress, we lack a systematic understanding of how AI advances may benefit scientific research across disciplines and fields. Here, drawing from the literature on the future of work and the science of science, we develop a measurement framework to estimate both the direct use of AI and the potential benefit of AI in scientific research, applying natural language processing techniques to 74.6 million publications and 7.1 million patents. We find that the use of AI in research is widespread throughout the sciences, growing especially rapidly since 2015, and papers that use AI exhibit a citation premium, more likely to be highly cited both within and outside their disciplines. Moreover, our analysis reveals considerable potential for AI to benefit numerous scientific fields, yet a notable disconnect exists between AI education and its research applications, highlighting a mismatch between the supply of AI expertise and its demand in research. Lastly, we examine demographic disparities in AI's benefits across scientific disciplines and find that disciplines with a higher proportion of women or Black scientists tend to be associated with less benefit, suggesting that AI's growing impact on research may further exacerbate existing inequalities in science. As the connection between AI and scientific research deepens, our findings may become increasingly important, with implications for the equity and sustainability of the research enterprise.

CVNov 27, 2023
Relightable 3D Gaussians: Realistic Point Cloud Relighting with BRDF Decomposition and Ray Tracing

Jian Gao, Chun Gu, Youtian Lin et al.

In this paper, we present a novel differentiable point-based rendering framework to achieve photo-realistic relighting. To make the reconstructed scene relightable, we enhance vanilla 3D Gaussians by associating extra properties, including normal vectors, BRDF parameters, and incident lighting from various directions. From a collection of multi-view images, the 3D scene is optimized through 3D Gaussian Splatting while BRDF and lighting are decomposed by physically based differentiable rendering. To produce plausible shadow effects in photo-realistic relighting, we introduce an innovative point-based ray tracing with the bounding volume hierarchies for efficient visibility pre-computation. Extensive experiments demonstrate our improved BRDF estimation, novel view synthesis and relighting results compared to state-of-the-art approaches. The proposed framework showcases the potential to revolutionize the mesh-based graphics pipeline with a point-based pipeline enabling editing, tracing, and relighting.

CLNov 14, 2025Code
LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

Jian Gao, Richeng Xuan, Zhaolu Kang et al.

The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce LaoBench, the first large-scale, high-quality, and multidimensional benchmark dataset dedicated to assessing LLMs' comprehensive language understanding and reasoning abilities in Lao. LaoBench comprises over 17,000 carefully curated samples spanning three core dimensions: knowledge application, K12 foundational education, and bilingual translation among Lao, Chinese, and English. The dataset is divided into open-source and closed-source subsets, with the closed-source portion enabling black-box evaluation on an official platform to ensure fairness and data security. Our data construction pipeline integrates expert human curation with automated agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational value. Benchmarking multiple state-of-the-art LLMs on LaoBench reveals that current models still face significant challenges in mastering Lao across diverse tasks. We hope LaoBench will catalyze further research and development of AI technologies for underrepresented Southeast Asian languages.

SOC-PHDec 24, 2017
Maximizing the Collective Learning Effects in Regional Economic Development

Jian Gao

Collective learning in economic development has been revealed by recent empirical studies, however, investigations on how to benefit most from its effects remain still lacking. In this paper, we explore the maximization of the collective learning effects using a simple propagation model to study the diversification of industries on real networks built on Brazilian labor data. For the inter-regional learning, we find an optimal strategy that makes a balance between core and periphery industries in the initial activation, considering the core-periphery structure of the industry space--a network representation of the relatedness between industries. For the inter-regional learning, we find an optimal strategy that makes a balance between nearby and distant regions in establishing new spatial connections, considering the spatial structure of the integrated adjacent network that connects all regions. Our findings suggest that the near to by random strategies are likely to make the best use of the collective learning effects in advancing regional economic development practices.

AIAug 23, 2024
Taming Text-to-Image Synthesis for Novices: User-centric Prompt Generation via Multi-turn Guidance

Yilun Liu, Minggui He, Feiyu Yao et al.

The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models are sensitive on textual prompts, posing a challenge for novice users who may not be familiar with TIS prompt writing. Existing solutions relieve this via automatic prompt expansion or generation from a user query. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. Thus, we propose DialPrompt, a dialogue-based TIS prompt generation model that emphasizes user experience for novice users. DialPrompt is designed to follow a multi-turn workflow, where in each round of dialogue the model guides user to express their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt improves user-centricity by allowing users to perceive and control the creation process of TIS prompts. Experiments indicate that DialPrompt improves significantly in user-centricity score compared with existing approaches while maintaining a competitive quality of synthesized images. In our user evaluation, DialPrompt is highly rated by 19 human reviewers (especially novices).

CLAug 9, 2024Code
reCSE: Portable Reshaping Features for Sentence Embedding in Self-supervised Contrastive Learning

Fufangchen Zhao, Jian Gao, Danfeng Yan

We propose reCSE, a self supervised contrastive learning sentence representation framework based on feature reshaping. This framework is different from the current advanced models that use discrete data augmentation methods, but instead reshapes the input features of the original sentence, aggregates the global information of each token in the sentence, and alleviates the common problems of representation polarity and GPU memory consumption linear increase in current advanced models. In addition, our reCSE has achieved competitive performance in semantic similarity tasks. And the experiment proves that our proposed feature reshaping method has strong universality, which can be transplanted to other self supervised contrastive learning frameworks and enhance their representation ability, even achieving state-of-the-art performance. Our code is available at https://github.com/heavenhellchen/reCSE.

LGFeb 28, 2025Code
AnalogGenie: A Generative Engine for Automatic Discovery of Analog Circuit Topologies

Jian Gao, Weidong Cao, Junyi Yang et al. · tsinghua

The massive and large-scale design of foundational semiconductor integrated circuits (ICs) is crucial to sustaining the advancement of many emerging and future technologies, such as generative AI, 5G/6G, and quantum computing. Excitingly, recent studies have shown the great capabilities of foundational models in expediting the design of digital ICs. Yet, applying generative AI techniques to accelerate the design of analog ICs remains a significant challenge due to critical domain-specific issues, such as the lack of a comprehensive dataset and effective representation methods for analog circuits. This paper proposes, $\textbf{AnalogGenie}$, a $\underline{\textbf{Gen}}$erat$\underline{\textbf{i}}$ve $\underline{\textbf{e}}$ngine for automatic design/discovery of $\underline{\textbf{Analog}}$ circuit topologies--the most challenging and creative task in the conventional manual design flow of analog ICs. AnalogGenie addresses two key gaps in the field: building a foundational comprehensive dataset of analog circuit topology and developing a scalable sequence-based graph representation universal to analog circuits. Experimental results show the remarkable generation performance of AnalogGenie in broadening the variety of analog ICs, increasing the number of devices within a single design, and discovering unseen circuit topologies far beyond any prior arts. Our work paves the way to transform the longstanding time-consuming manual design flow of analog ICs to an automatic and massive manner powered by generative AI. Our source code is available at https://github.com/xz-group/AnalogGenie.

ASSep 18, 2023
HTEC: Human Transcription Error Correction

Hanbo Sun, Jian Gao, Xiaomin Wu et al.

High-quality human transcription is essential for training and improving Automatic Speech Recognition (ASR) models. Recent study~\cite{libricrowd} has found that every 1% worse transcription Word Error Rate (WER) increases approximately 2% ASR WER by using the transcriptions to train ASR models. Transcription errors are inevitable for even highly-trained annotators. However, few studies have explored human transcription correction. Error correction methods for other problems, such as ASR error correction and grammatical error correction, do not perform sufficiently for this problem. Therefore, we propose HTEC for Human Transcription Error Correction. HTEC consists of two stages: Trans-Checker, an error detection model that predicts and masks erroneous words, and Trans-Filler, a sequence-to-sequence generative model that fills masked positions. We propose a holistic list of correction operations, including four novel operations handling deletion errors. We further propose a variant of embeddings that incorporates phoneme information into the input of the transformer. HTEC outperforms other methods by a large margin and surpasses human annotators by 2.2% to 4.5% in WER. Finally, we deployed HTEC to assist human annotators and showed HTEC is particularly effective as a co-pilot, which improves transcription quality by 15.1% without sacrificing transcription velocity.

CLSep 24, 2023
Human Transcription Quality Improvement

Jian Gao, Hanbo Sun, Cheng Cao et al.

High quality transcription data is crucial for training automatic speech recognition (ASR) systems. However, the existing industry-level data collection pipelines are expensive to researchers, while the quality of crowdsourced transcription is low. In this paper, we propose a reliable method to collect speech transcriptions. We introduce two mechanisms to improve transcription quality: confidence estimation based reprocessing at labeling stage, and automatic word error correction at post-labeling stage. We collect and release LibriCrowd - a large-scale crowdsourced dataset of audio transcriptions on 100 hours of English speech. Experiment shows the Transcription WER is reduced by over 50%. We further investigate the impact of transcription error on ASR model performance and found a strong correlation. The transcription quality improvement provides over 10% relative WER reduction for ASR models. We release the dataset and code to benefit the research community.

AISep 19, 2023
Human-AI Interactions and Societal Pitfalls

Francisco Castro, Jian Gao, Sébastien Martin

When working with generative artificial intelligence (AI), users may see productivity gains, but the AI-generated content may not match their preferences exactly. To study this effect, we introduce a Bayesian framework in which heterogeneous users choose how much information to share with the AI, facing a trade-off between output fidelity and communication cost. We show that the interplay between these individual-level decisions and AI training may lead to societal challenges. Outputs may become more homogenized, especially when the AI is trained on AI-generated content, potentially triggering a homogenization death spiral. And any AI bias may propagate to become societal bias. A solution to the homogenization and bias issues is to reduce human-AI interaction frictions and enable users to flexibly share information, leading to personalized outputs without sacrificing productivity.

CLJan 23
Dynamic Role Assignment for Multi-Agent Debate

Miao Zhang, Junsik Kim, Siyuan Xiang et al.

Multi-agent large language model (LLM) and vision-language model (VLM) debate systems employ specialized roles for complex problem-solving, yet model specializations are not leveraged to decide which model should fill which role. We propose dynamic role assignment, a framework that runs a Meta-Debate to select suitable agents before the actual debate. The meta-debate has two stages: (1) proposal, where candidates provide role-tailored arguments, and (2) peer review, where proposals are scored with data and role-specific criteria to choose the best agent for each position. We evaluate our method on LLM problem solving benchmarks. Applied on top of existing debate systems, our approach consistently outperforms uniform assignments (filling all roles with the same model) by up to 74.8% and random assignments (assigning models to roles without considering their suitability) by up to 29.7%, depending on the task and the specific assignment. This work establishes a new paradigm for multi-agent system design, shifting from static agent deployment to dynamic and capability-aware selection.

CVApr 9Code
Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

Xun Zhu, Fanbin Mo, Xi Chen et al.

The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.

AINov 11, 2023
An Intelligent Social Learning-based Optimization Strategy for Black-box Robotic Control with Reinforcement Learning

Xubo Yang, Jian Gao, Ting Wang et al.

Implementing intelligent control of robots is a difficult task, especially when dealing with complex black-box systems, because of the lack of visibility and understanding of how these robots work internally. This paper proposes an Intelligent Social Learning (ISL) algorithm to enable intelligent control of black-box robotic systems. Inspired by mutual learning among individuals in human social groups, ISL includes learning, imitation, and self-study styles. Individuals in the learning style use the Levy flight search strategy to learn from the best performer and form the closest relationships. In the imitation style, individuals mimic the best performer with a second-level rapport by employing a random perturbation strategy. In the self-study style, individuals learn independently using a normal distribution sampling method while maintaining a distant relationship with the best performer. Individuals in the population are regarded as autonomous intelligent agents in each style. Neural networks perform strategic actions in three styles to interact with the environment and the robot and iteratively optimize the network policy. Overall, ISL builds on the principles of intelligent optimization, incorporating ideas from reinforcement learning, and possesses strong search capabilities, fast computation speed, fewer hyperparameters, and insensitivity to sparse rewards. The proposed ISL algorithm is compared with four state-of-the-art methods on six continuous control benchmark cases in MuJoCo to verify its effectiveness and advantages. Furthermore, ISL is adopted in the simulation and experimental grasping tasks of the UR3 robot for validations, and satisfactory solutions are yielded.

LGNov 10, 2025
ZeroSim: Zero-Shot Analog Circuit Evaluation with Unified Transformer Embeddings

Xiaomeng Yang, Jian Gao, Yanzhi Wang et al.

Although recent advancements in learning-based analog circuit design automation have tackled tasks such as topology generation, device sizing, and layout synthesis, efficient performance evaluation remains a major bottleneck. Traditional SPICE simulations are time-consuming, while existing machine learning methods often require topology-specific retraining or manual substructure segmentation for fine-tuning, hindering scalability and adaptability. In this work, we propose ZeroSim, a transformer-based performance modeling framework designed to achieve robust in-distribution generalization across trained topologies under novel parameter configurations and zero-shot generalization to unseen topologies without any fine-tuning. We apply three key enabling strategies: (1) a diverse training corpus of 3.6 million instances covering over 60 amplifier topologies, (2) unified topology embeddings leveraging global-aware tokens and hierarchical attention to robustly generalize to novel circuits, and (3) a topology-conditioned parameter mapping approach that maintains consistent structural representations independent of parameter variations. Our experimental results demonstrate that ZeroSim significantly outperforms baseline models such as multilayer perceptrons, graph neural networks and transformers, delivering accurate zero-shot predictions across different amplifier topologies. Additionally, when integrated into a reinforcement learning-based parameter optimization pipeline, ZeroSim achieves a remarkable speedup (13x) compared to conventional SPICE simulations, underscoring its practical value for a wide range of analog circuit design automation tasks.

CVSep 26, 2025Code
CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

Arman Akbari, Jian Gao, Yifei Zou et al.

Engineering design operates through hierarchical abstraction from system specifications to component implementations, requiring visual understanding coupled with mathematical reasoning at each level. While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored. We present \textbf{CircuitSense}, a comprehensive benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams. Our benchmark uniquely examines the complete engineering workflow: Perception, Analysis, and Design, with a particular emphasis on the critical but underexplored capability of deriving symbolic equations from visual inputs. We introduce a hierarchical synthetic generation pipeline consisting of a grid-based schematic generator and a block diagram generator with auto-derived symbolic equation labels. Comprehensive evaluation of six state-of-the-art MLLMs, including both closed-source and open-source models, reveals fundamental limitations in visual-to-mathematical reasoning. Closed-source models achieve over 85\% accuracy on perception tasks involving component recognition and topology identification, yet their performance on symbolic derivation and analytical reasoning falls below 19\%, exposing a critical gap between visual parsing and symbolic reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy, confirming the fundamental role of mathematical understanding in circuit synthesis and establishing symbolic reasoning as the key metric for engineering competence.

CVDec 5, 2024Code
4D SlingBAG: spatial-temporal coupled Gaussian ball for large-scale dynamic 3D photoacoustic iterative reconstruction

Shuang Li, Yibing Wang, Jian Gao et al.

Large-scale dynamic three-dimensional (3D) photoacoustic imaging (PAI) is significantly important in clinical applications. In practical implementations, large-scale 3D real-time PAI systems typically utilize sparse two-dimensional (2D) sensor arrays with certain angular deficiencies, necessitating advanced iterative reconstruction (IR) algorithms to achieve quantitative PAI and reduce reconstruction artifacts. However, for existing IR algorithms, multi-frame 3D reconstruction leads to extremely high memory consumption and prolonged computation time, with limited consideration of the spatial-temporal continuity between data frames. Here, we propose a novel method, named the 4D sliding Gaussian ball adaptive growth (4D SlingBAG) algorithm, based on the current point cloud-based IR algorithm sliding Gaussian ball adaptive growth (SlingBAG), which has minimal memory consumption among IR methods. Our 4D SlingBAG method applies spatial-temporal coupled deformation functions to each Gaussian sphere in point cloud, thus explicitly learning the deformations features of the dynamic 3D PA scene. This allows for the efficient representation of various physiological processes (such as pulsation) or external pressures (e.g., blood perfusion experiments) contributing to changes in vessel morphology and blood flow during dynamic 3D PAI, enabling highly efficient IR for dynamic 3D PAI. Simulation experiments demonstrate that 4D SlingBAG achieves high-quality dynamic 3D PA reconstruction. Compared to performing reconstructions by using SlingBAG algorithm individually for each frame, our method significantly reduces computational time and keeps a extremely low memory consumption. The project for 4D SlingBAG can be found in the following GitHub repository: \href{https://github.com/JaegerCQ/4D-SlingBAG}{https://github.com/JaegerCQ/4D-SlingBAG}.

IVSep 23, 2021Code
Rational Polynomial Camera Model Warping for Deep Learning Based Satellite Multi-View Stereo Matching

Jian Gao, Jin Liu, Shunping Ji

Satellite multi-view stereo (MVS) imagery is particularly suited for large-scale Earth surface reconstruction. Differing from the perspective camera model (pin-hole model) that is commonly used for close-range and aerial cameras, the cubic rational polynomial camera (RPC) model is the mainstream model for push-broom linear-array satellite cameras. However, the homography warping used in the prevailing learning based MVS methods is only applicable to pin-hole cameras. In order to apply the SOTA learning based MVS technology to the satellite MVS task for large-scale Earth surface reconstruction, RPC warping should be considered. In this work, we propose, for the first time, a rigorous RPC warping module. The rational polynomial coefficients are recorded as a tensor, and the RPC warping is formulated as a series of tensor transformations. Based on the RPC warping, we propose the deep learning based satellite MVS (SatMVS) framework for large-scale and wide depth range Earth surface reconstruction. We also introduce a large-scale satellite image dataset consisting of 519 5120${\times}$5120 images, which we call the TLC SatMVS dataset. The satellite images were acquired from a three-line camera (TLC) that catches triple-view images simultaneously, forming a valuable supplement to the existing open-source WorldView-3 datasets with single-scanline images. Experiments show that the proposed RPC warping module and the SatMVS framework can achieve a superior reconstruction accuracy compared to the pin-hole fitting method and conventional MVS methods. Code and data are available at https://github.com/WHU-GPCV/SatMVS.

AISep 26, 2024
Enhancing elusive clues in knowledge learning by contrasting attention of language models

Jian Gao, Xiao Zhang, Ji Wu et al.

Causal language models acquire vast amount of knowledge from general text corpus during pretraining, but the efficiency of knowledge learning is known to be unsatisfactory, especially when learning from knowledge-dense and small-sized corpora. The deficiency can come from long-distance dependencies which are hard to capture by language models, and overfitting to co-occurrence patterns and distracting clues in the training text. To address these issues, the paper proposes a method to enhance knowledge learning during language model pretraining, by enhancing elusive but important clues in text discovered by the language model themselves. We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models. Therefore, we can identify these clues by contrasting the attention weights of large and small language models. We use the identified clues as a guide to perform token-dropout data augmentation on the training text, and observed a significant boost in both small and large models' performance in fact memorization. This shows that the behavior contrast between more and less-performant language models contains important clues for knowledge learning, and it can be ``amplified" for a straight-forward improvement in knowledge learning efficiency.

LGJan 29
PowerGenie: Analytically-Guided Evolutionary Discovery of Superior Reconfigurable Power Converters

Jian Gao, Yiwei Zou, Abhishek Pradhan et al.

Discovering superior circuit topologies requires navigating an exponentially large design space-a challenge traditionally reserved for human experts. Existing AI methods either select from predefined templates or generate novel topologies at a limited scale without rigorous verification, leaving large-scale performance-driven discovery underexplored. We present PowerGenie, a framework for automated discovery of higher-performance reconfigurable power converters at scale. PowerGenie introduces: (1) an automated analytical framework that determines converter functionality and theoretical performance limits without component sizing or SPICE simulation, and (2) an evolutionary finetuning method that co-evolves a generative model with its training distribution through fitness selection and uniqueness verification. Unlike existing methods that suffer from mode collapse and overfitting, our approach achieves higher syntax validity, function validity, novelty rate, and figure-of-merit (FoM). PowerGenie discovers a novel 8-mode reconfigurable converter with 23% higher FoM than the best training topology. SPICE simulations confirm average absolute efficiency gains of 10% across 8 modes and up to 17% at a single mode. Code will be released upon publication.

IVDec 21, 2023
Hunting imaging biomarkers in pulmonary fibrosis: Benchmarks of the AIIB23 challenge

Yang Nan, Xiaodan Xing, Shiyi Wang et al.

Airway-related quantitative imaging biomarkers are crucial for examination, diagnosis, and prognosis in pulmonary diseases. However, the manual delineation of airway trees remains prohibitively time-consuming. While significant efforts have been made towards enhancing airway modelling, current public-available datasets concentrate on lung diseases with moderate morphological variations. The intricate honeycombing patterns present in the lung tissues of fibrotic lung disease patients exacerbate the challenges, often leading to various prediction errors. To address this issue, the 'Airway-Informed Quantitative CT Imaging Biomarker for Fibrotic Lung Disease 2023' (AIIB23) competition was organized in conjunction with the official 2023 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). The airway structures were meticulously annotated by three experienced radiologists. Competitors were encouraged to develop automatic airway segmentation models with high robustness and generalization abilities, followed by exploring the most correlated QIB of mortality prediction. A training set of 120 high-resolution computerised tomography (HRCT) scans were publicly released with expert annotations and mortality status. The online validation set incorporated 52 HRCT scans from patients with fibrotic lung disease and the offline test set included 140 cases from fibrosis and COVID-19 patients. The results have shown that the capacity of extracting airway trees from patients with fibrotic lung disease could be enhanced by introducing voxel-wise weighted general union loss and continuity loss. In addition to the competitive image biomarkers for prognosis, a strong airway-derived biomarker (Hazard ratio>1.5, p<0.0001) was revealed for survival prognostication compared with existing clinical measurements, clinician assessment and AI-based biomarkers.

CVApr 22
Random Walk on Point Clouds for Feature Detection

Yuhe Zhang, Zhikun Tu, Zhi Li et al.

The points on the point clouds that can entirely outline the shape of the model are of critical importance, as they serve as the foundation for numerous point cloud processing tasks and are widely utilized in computer graphics and computer-aided design. This study introduces a novel method, RWoDSN, for extracting such feature points, incorporating considerations of sharp-to-smooth transitions, large-to-small scales, and textural-to-detailed features. We approach feature extraction as a two-stage context-dependent analysis problem. In the first stage, we propose a novel neighborhood descriptor, termed the Disk Sampling Neighborhood (DSN), which, unlike traditional spatially and geometrically invariant approaches, preserves a matrix structure while maintaining normal neighborhood relationships. In the second stage, a random walk is performed on the DSN (RWoDSN), yielding a graph-based DSN that simultaneously accounts for the spatial distribution, topological properties, and geometric characteristics of the local surface surrounding each point. This enables the effective extraction of feature points. Experimental results demonstrate that the proposed RWoDSN method achieves a recall of 0.769-22% higher than the current state-of-the-art-alongside a precision of 0.784. Furthermore, it significantly outperforms several traditional and deep-learning techniques across eight evaluation metrics.

CVMar 12, 2025
FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models

Fufangchen Zhao, Ming Li, Linrui Xu et al.

Video-based multimodal large language models (VMLLMs) have demonstrated remarkable potential in cross-modal video understanding. However, their abilities in fine-grained face comprehension remain largely underexplored. Given its pivotal role in human-centric intelligence, developing VMLLMs for facial understanding holds a fundamental problem. To address this gap, we propose FaVChat, the first VMLLM specifically designed for fine-grained facial video understanding. To facilitate its training, we construct a large-scale facial video dataset comprising over 60k videos, with the majority annotated with 83 fine-grained facial attributes. These attributes are incorporated to enrich GPT-4o-generated captions, yielding 60k high-quality video-summary pairs and an additional 170k fine-grained question-answering (QA) pairs. To effectively capture rich facial clues, we propose a hybrid model architecture composed of a general visual encoder, a dedicated facial encoder, and a mixture-of-experts-enhanced adapter for adaptive fusion of multi-source visual features. To mitigate information loss during feature transformation, we extract multi-granularity representations from the facial encoder and integrate them into the subsequent LLM. This design enhances the model's ability to comprehend and respond to questions involving diverse levels of visual details. We employ a progressive training paradigm, transitioning from video summarization to a high-quality subset of video QA, gradually increasing task complexity to enhance the model's fine-grained visual perception. We conduct extensive zero-shot evaluation on a couple of public benchmarks, demonstrating that FaVChat consistently surpasses existing VMLLMs across multiple tasks.

CVSep 11, 2025
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Jiahao Wang, Yufeng Yuan, Rujie Zheng et al.

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

LGJul 20, 2025
AnalogFed: Federated Discovery of Analog Circuit Topologies with Generative AI

Qiufeng Li, Shu Hong, Jian Gao et al. · tsinghua

Recent breakthroughs in AI/ML offer exciting opportunities to revolutionize analog design automation through data-driven approaches. In particular, researchers are increasingly fascinated by harnessing the power of generative AI to automate the discovery of novel analog circuit topologies. Unlocking the full potential of generative AI in these data-driven discoveries requires access to large and diverse datasets.Yet, there is a significant barrier in the analog domain--Analog circuit design is inherently proprietary, involving not only confidential circuit structures but also the underlying commercial semiconductor processes. As a result, current generative AI research is largely confined to individual researchers who construct small, narrowly focused private datasets. This fragmentation severely limits collaborative innovation and impedes progress across the research community. To address these challenges, we propose AnalogFed. AnalogFed enables collaborative topology discovery across decentralized clients (e.g., individual researchers or institutions) without requiring the sharing of raw private data. To make this vision practical, we introduce a suite of techniques tailored to the unique challenges of applying FedL in analog design--from generative model development and data heterogeneity handling to privacy-preserving strategies that ensure both flexibility and security for circuit designers and semiconductor manufacturers. Extensive experiments across varying client counts and dataset sizes demonstrate that AnalogFed achieves performance comparable to centralized baselines--while maintaining strict data privacy. Specifically, the generative AI model within AnalogFed achieves state-of-the-art efficiency and scalability in the design of analog circuit topologies.

CLJan 5
Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency

Haoran Wang, Maryam Khalid, Qiong Wu et al.

Large language models (LLMs) are increasingly used in applications requiring factual accuracy, yet their outputs often contain hallucinated responses. While fact-checking can mitigate these errors, existing methods typically retrieve external evidence indiscriminately, overlooking the model's internal knowledge and potentially introducing irrelevant noise. Moreover, current systems lack targeted mechanisms to resolve specific uncertainties in the model's reasoning. Inspired by how humans fact-check, we argue that LLMs should adaptively decide whether to rely on internal knowledge or initiate retrieval based on their confidence in a given claim. We introduce Probabilistic Certainty and Consistency (PCC), a framework that estimates factual confidence by jointly modeling an LLM's probabilistic certainty and reasoning consistency. These confidence signals enable an adaptive verification strategy: the model answers directly when confident, triggers targeted retrieval when uncertain or inconsistent, and escalates to deep search when ambiguity is high. Our confidence-guided routing mechanism ensures that retrieval is invoked only when necessary, improving both efficiency and reliability. Extensive experiments across three challenging benchmarks show that PCC achieves better uncertainty quantification than verbalized confidence and consistently outperforms strong LLM-based fact-checking baselines. Furthermore, we demonstrate that PCC generalizes well across various LLMs.

CVNov 24, 2025
VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

Fufangchen Zhao, Liao Zhang, Daiqi Shi et al.

We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs' limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from complete videos outperform those from degraded inputs, explicitly training the model to recover temporally precise action details. We also curate a dataset of 80,000 videos with fine-grained actions and transient events. Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks, while maintaining strong performance on standard tasks. By prioritizing task-relevant visual features, our work redefines video-language model training for fine-grained perception.

CVOct 9, 2025
ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes

Jian Gao, Mengqi Yuan, Yifei Zeng et al.

Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object-scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object-scene composition primarily concerns the object's appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object's placement. Specifically, we capture a 360 degrees reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object-scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. Code and dataset are available at https://nju-3dv.github.io/projects/ComGS/.

CLSep 16, 2025
SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data

Jian Gao, Fufangchen Zhao, Yiyang Zhang et al.

Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions. Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback. In this paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation. SitLLM comprises three key components: (1) a \textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that reprograms sensor embeddings into the LLM's semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textit{Multi-Context Prompt Module} that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension.

LGJul 23, 2025
Self-similarity Analysis in Deep Neural Networks

Jingyi Ding, Chengwen Qi, Hongfei Wang et al.

Current research has found that some deep neural networks exhibit strong hierarchical self-similarity in feature representation or parameter distribution. However, aside from preliminary studies on how the power-law distribution of weights across different training stages affects model performance,there has been no quantitative analysis on how the self-similarity of hidden space geometry influences model weight optimization, nor is there a clear understanding of the dynamic behavior of internal neurons. Therefore, this paper proposes a complex network modeling method based on the output features of hidden-layer neurons to investigate the self-similarity of feature networks constructed at different hidden layers, and analyzes how adjusting the degree of self-similarity in feature networks can enhance the classification performance of deep neural networks. Validated on three types of networks MLP architectures, convolutional networks, and attention architectures this study reveals that the degree of self-similarity exhibited by feature networks varies across different model architectures. Furthermore, embedding constraints on the self-similarity of feature networks during the training process can improve the performance of self-similar deep neural networks (MLP architectures and attention architectures) by up to 6 percentage points.

LGJul 16, 2025
HyperEvent: A Strong Baseline for Dynamic Link Prediction via Relative Structural Encoding

Jian Gao, Jianshe Wu, JingYi Ding

Learning representations for continuous-time dynamic graphs is critical for dynamic link prediction. While recent methods have become increasingly complex, the field lacks a strong and informative baseline to reliably gauge progress. This paper proposes HyperEvent, a simple approach that captures relative structural patterns in event sequences through an intuitive encoding mechanism. As a straightforward baseline, HyperEvent leverages relative structural encoding to identify meaningful event sequences without complex parameterization. By combining these interpretable features with a lightweight transformer classifier, HyperEvent reframes link prediction as event structure recognition. Despite its simplicity, HyperEvent achieves competitive results across multiple benchmarks, often matching the performance of more complex models. This work demonstrates that effective modeling can be achieved through simple structural encoding, providing a clear reference point for evaluating future advancements.

CVMar 8, 2025
A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts

Wenzhuo Du, Gerun Wang, Guancheng Chen et al.

With the exponential growth of user-generated content on video-sharing platforms, the challenge of facilitating efficient searching and browsing of videos has garnered significant attention. To enhance users' ability to swiftly locate and review pertinent videos, the creation of concise and informative video summaries has become increasingly important. Video-llama is an effective tool for generating video summarization, but it cannot effectively unify and optimize the modeling of temporal and spatial features and requires a lot of computational resources and time. Therefore, we propose MiLoRA-ViSum to more efficiently capture complex temporal dynamics and spatial relationships inherent in video data and to control the number of parameters for training. By extending traditional Low-Rank Adaptation (LoRA) into a sophisticated mixture-of-experts paradigm, MiLoRA-ViSum incorporates a dual temporal-spatial adaptation mechanism tailored specifically for video summarization tasks. This approach dynamically integrates specialized LoRA experts, each fine-tuned to address distinct temporal or spatial dimensions. Extensive evaluations of the VideoXum and ActivityNet datasets demonstrate that MiLoRA-ViSum achieves the best summarization performance compared to state-of-the-art models, while maintaining significantly lower computational costs. The proposed mixture-of-experts strategy, combined with the dual adaptation mechanism, highlights the model's potential to enhance video summarization capabilities, particularly in large-scale applications requiring both efficiency and precision.

CLJun 12, 2024
Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests

Amogh Mannekote, Jinseok Nam, Ziming Li et al.

Indirect User Requests (IURs), such as "It's cold in here" instead of "Could you please increase the temperature?" are common in human-human task-oriented dialogue and require world knowledge and pragmatic reasoning from the listener. While large language models (LLMs) can handle these requests effectively, smaller models deployed on virtual assistants often struggle due to resource constraints. Moreover, existing task-oriented dialogue benchmarks lack sufficient examples of complex discourse phenomena such as indirectness. To address this, we propose a set of linguistic criteria along with an LLM-based pipeline for generating realistic IURs to test natural language understanding (NLU) and dialogue state tracking (DST) models before deployment in a new domain. We also release IndirectRequests, a dataset of IURs based on the Schema Guided Dialog (SGD) corpus, as a comparative testbed for evaluating the performance of smaller models in handling indirect requests.

IVSep 8, 2020
Unsupervised Change Detection in Satellite Images with Generative Adversarial Network

Caijun Ren, Xiangyu Wang, Jian Gao et al.

Detecting changed regions in paired satellite images plays a key role in many remote sensing applications. The evolution of recent techniques could provide satellite images with very high spatial resolution (VHR) but made it challenging to apply image coregistration, and many change detection methods are dependent on its accuracy.Two images of the same scene taken at different time or from different angle would introduce unregistered objects and the existence of both unregistered areas and actual changed areas would lower the performance of many change detection algorithms in unsupervised condition.To alleviate the effect of unregistered objects in the paired images, we propose a novel change detection framework utilizing a special neural network architecture -- Generative Adversarial Network (GAN) to generate many better coregistered images. In this paper, we show that GAN model can be trained upon a pair of images through using the proposed expanding strategy to create a training set and optimizing designed objective functions. The optimized GAN model would produce better coregistered images where changes can be easily spotted and then the change map can be presented through a comparison strategy using these generated images explicitly.Compared to other deep learning-based methods, our method is less sensitive to the problem of unregistered images and makes most of the deep learning structure.Experimental results on synthetic images and real data with many different scenes could demonstrate the effectiveness of the proposed approach.

LGJul 30, 2019
An anomaly prediction framework for financial IT systems using hybrid machine learning methods

Jingwen Wang, Jingxin Liu, Juntao Pu et al.

In financial field, a robust software system is of vital importance to ensure the smooth operation of financial transactions. However, many financial corporations still depend on operators to identify and eliminate the system failures when financial software systems break down. This traditional operation method is time consuming and extremely inefficient. To improve the efficiency and accuracy of system failure detection and thereby reduce the impact of system failures on financial services, we propose a novel machine learning-based framework to predict the occurrence of system exceptions and failures in a financial software system. In particular, we first extract rich information from system logs and eliminate noises in the data. Then the cleaned data is leveraged as the input of our proposed anomaly prediction framework which consists of three modules: key performance indicator(KPI) data prediction module, anomaly identification module and severity classification module. Notably, we design a hierarchical architecture of alarm classifiers and try to alleviate the influence of class-imbalance problem on the overall performance. Empirically, the experimental results demonstrate the superior performance of our proposed method on a real-world financial software system log data set.

LGNov 3, 2018
Nonparallel Emotional Speech Conversion

Jian Gao, Deep Chakraborty, Hamidou Tembine et al.

We propose a nonparallel data-driven emotional speech conversion method. It enables the transfer of emotion-related characteristics of a speech signal while preserving the speaker's identity and linguistic content. Most existing approaches require parallel data and time alignment, which is not available in most real applications. We achieve nonparallel training based on an unsupervised style transfer technique, which learns a translation model between two distributions instead of a deterministic one-to-one mapping between paired examples. The conversion model consists of an encoder and a decoder for each emotion domain. We assume that the speech signal can be decomposed into an emotion-invariant content code and an emotion-related style code in latent space. Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion. We tested our method on a nonparallel corpora with four emotions. Both subjective and objective evaluations show the effectiveness of our approach.

IRMar 22, 2018
A trust-based recommendation method using network diffusion processes

Ling-Jiao Chen, Jian Gao

A variety of rating-based recommendation methods have been extensively studied including the well-known collaborative filtering approaches and some network diffusion-based methods, however, social trust relations are not sufficiently considered when making recommendations. In this paper, we contribute to the literature by proposing a trust-based recommendation method, named CosRA+T, after integrating the information of trust relations into the resource-redistribution process. Specifically, a tunable parameter is used to scale the resources received by trusted users before the redistribution back to the objects. Interestingly, we find an optimal scaling parameter for the proposed CosRA+T method to achieve its best recommendation accuracy, and the optimal value seems to be universal under several evaluation metrics across different datasets. Moreover, results of extensive experiments on the two real-world rating datasets with trust relations, Epinions and FriendFeed, suggest that CosRA+T has a remarkable improvement in overall accuracy, diversity, and novelty. Our work takes a step towards designing better recommendation algorithms by employing multiple resources of social network information.

AIMay 31, 2017
Diversified Top-k Partial MaxSAT Solving

Junping Zhou, Huanyao Sun, Feifei Ma et al.

We introduce a diversified top-k partial MaxSAT problem, a combination of partial MaxSAT problem and enumeration problem. Given a partial MaxSAT formula F and a positive integer k, the diversified top-k partial MaxSAT is to find k maximal solutions for F such that the k maximal solutions satisfy the maximum number of soft clauses of F. This problem can be widely used in many applications including community detection, sensor place, motif discovery, and combinatorial testing. We prove the problem is NP-hard and propose an approach for solving the problem. The concrete idea of the approach is to design an encoding EE which reduces diversified top-k partial MaxSAT problem into partial MaxSAT problem, and then solve the resulting problem with state-of-art solvers. In addition, we present an algorithm MEMKC exactly solving the diversified top-k partial MaxSAT. Through several experiments we show that our approach can be successfully applied to the interesting problem.

IROct 8, 2015
A vertex similarity index for better personalized recommendation

Ling-Jiao Chen, Zi-Ke Zhang, Jin-Hu Liu et al.

Recommender systems benefit us in tackling the problem of information overload by predicting our potential choices among diverse niche objects. So far, a variety of personalized recommendation algorithms have been proposed and most of them are based on similarities, such as collaborative filtering and mass diffusion. Here, we propose a novel vertex similarity index named CosRA, which combines advantages of both the cosine index and the resource-allocation (RA) index. By applying the CosRA index to real recommender systems including MovieLens, Netflix and RYM, we show that the CosRA-based method has better performance in accuracy, diversity and novelty than some benchmark methods. Moreover, the CosRA index is free of parameters, which is a significant advantage in real applications. Further experiments show that the introduction of two turnable parameters cannot remarkably improve the overall performance of the CosRA index.

IRSep 2, 2015
Evaluating user reputation in online rating systems via an iterative group-based ranking method

Jian Gao, Tao Zhou

Reputation is a valuable asset in online social lives and it has drawn increased attention. How to evaluate user reputation in online rating systems is especially significant due to the existence of spamming attacks. To address this issue, so far, a variety of methods have been proposed, including network-based methods, quality-based methods and group-based ranking method. In this paper, we propose an iterative group-based ranking (IGR) method by introducing an iterative reputation-allocation process into the original group-based ranking (GR) method. More specifically, users with higher reputation have higher weights in dominating the corresponding group sizes. The reputation of users and the corresponding group sizes are iteratively updated until they become stable. Results on two real data sets suggest that the proposed IGR method has better performance and its robustness is considerably improved comparing with the original GR method. Our work highlights the positive role of users' grouping behavior towards a better reputation evaluation.

IRJan 4, 2015
Group-based ranking method for online rating systems with spamming attacks

Jian Gao, Yu-Wei Dong, Mingsheng Shang et al.

Ranking problem has attracted much attention in real systems. How to design a robust ranking method is especially significant for online rating systems under the threat of spamming attacks. By building reputation systems for users, many well-performed ranking methods have been applied to address this issue. In this Letter, we propose a group-based ranking method that evaluates users' reputations based on their grouping behaviors. More specifically, users are assigned with high reputation scores if they always fall into large rating groups. Results on three real data sets indicate that the present method is more accurate and robust than correlation-based method in the presence of spamming attacks.