Qi Yan

CV
h-index98
24papers
428citations
Novelty50%
AI Score59

24 Papers

LGJul 4, 2023Code
SwinGNN: Rethinking Permutation Invariance in Diffusion Models for Graph Generation

Qi Yan, Zhengyang Liang, Yang Song et al. · stanford

Diffusion models based on permutation-equivariant networks can learn permutation-invariant distributions for graph data. However, in comparison to their non-invariant counterparts, we have found that these invariant models encounter greater learning challenges since 1) their effective target distributions exhibit more modes; 2) their optimal one-step denoising scores are the score functions of Gaussian mixtures with more components. Motivated by this analysis, we propose a non-invariant diffusion model, called $\textit{SwinGNN}$, which employs an efficient edge-to-edge 2-WL message passing network and utilizes shifted window based self-attention inspired by SwinTransformers. Further, through systematic ablations, we identify several critical training and sampling techniques that significantly improve the sample quality of graph generation. At last, we introduce a simple post-processing trick, $\textit{i.e.}$, randomly permuting the generated graphs, which provably converts any graph generative model to a permutation-invariant one. Extensive experiments on synthetic and real-world protein and molecule datasets show that our SwinGNN achieves state-of-the-art performances. Our code is released at https://github.com/qiyan98/SwinGNN.

CVJul 23, 2024Code
Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos

Jiahe Liu, Youran Qu, Qi Yan et al.

Significant advancements have been made in video generative models recently. Unlike image generation, video generation presents greater challenges, requiring not only generating high-quality frames but also ensuring temporal consistency across these frames. Despite the impressive progress, research on metrics for evaluating the quality of generated videos, especially concerning temporal and motion consistency, remains underexplored. To bridge this research gap, we propose Fréchet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fréchet distance. We conduct sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics. Additionally, our motion features can consistently improve the performance of Video Quality Assessment (VQA) models, indicating that our approach is also applicable to unary video quality evaluation. Code is available at https://github.com/ljh0v0/FMD-frechet-motion-distance.

LGOct 3, 2023Code
AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval

Qi Yan, Raihan Seraj, Jiawei He et al.

Machine-based prediction of real-world events is garnering attention due to its potential for informed decision-making. Whereas traditional forecasting predominantly hinges on structured data like time-series, recent breakthroughs in language models enable predictions using unstructured text. In particular, (Zou et al., 2022) unveils AutoCast, a new benchmark that employs news articles for answering forecasting queries. Nevertheless, existing methods still trail behind human performance. The cornerstone of accurate forecasting, we argue, lies in identifying a concise, yet rich subset of news snippets from a vast corpus. With this motivation, we introduce AutoCast++, a zero-shot ranking-based context retrieval system, tailored to sift through expansive news document collections for event forecasting. Our approach first re-ranks articles based on zero-shot question-passage relevance, honing in on semantically pertinent news. Following this, the chosen articles are subjected to zero-shot summarization to attain succinct context. Leveraging a pre-trained language model, we conduct both the relevance evaluation and article summarization without needing domain-specific training. Notably, recent articles can sometimes be at odds with preceding ones due to new facts or unanticipated incidents, leading to fluctuating temporal dynamics. To tackle this, our re-ranking mechanism gives preference to more recent articles, and we further regularize the multi-passage representation learning to align with human forecaster responses made on different dates. Empirical results underscore marked improvements across multiple metrics, improving the performance for multiple-choice questions (MCQ) by 48% and true/false (TF) questions by up to 8%. Code is available at https://github.com/BorealisAI/Autocast-plus-plus.

LGJun 21, 2023Code
What Constitutes Good Contrastive Learning in Time-Series Forecasting?

Chiyu Zhang, Qi Yan, Lili Meng et al.

In recent years, the introduction of self-supervised contrastive learning (SSCL) has demonstrated remarkable improvements in representation learning across various domains, including natural language processing and computer vision. By leveraging the inherent benefits of self-supervision, SSCL enables the pre-training of representation models using vast amounts of unlabeled data. Despite these advances, there remains a significant gap in understanding the impact of different SSCL strategies on time series forecasting performance, as well as the specific benefits that SSCL can bring. This paper aims to address these gaps by conducting a comprehensive analysis of the effectiveness of various training variables, including different SSCL algorithms, learning strategies, model architectures, and their interplay. Additionally, to gain deeper insights into the improvements brought about by SSCL in the context of time-series forecasting, a qualitative analysis of the empirical receptive field is performed. Through our experiments, we demonstrate that the end-to-end training of a Transformer model using the Mean Squared Error (MSE) loss and SSCL emerges as the most effective approach in time series forecasting. Notably, the incorporation of the contrastive objective enables the model to prioritize more pertinent information for forecasting, such as scale and periodic relationships. These findings contribute to a better understanding of the benefits of SSCL in time series forecasting and provide valuable insights for future research in this area. Our codes are available at https://github.com/chiyuzhang94/contrastive_learning_time-series_e2e.

AIMay 31, 2022
Hierarchically Constrained Adaptive Ad Exposure in Feeds

Dagui Chen, Qi Yan, Chunjie Chen et al.

A contemporary feed application usually provides blended results of organic items and sponsored items~(ads) to users. Conventionally, ads are exposed at fixed positions. Such a static exposure strategy is inefficient due to ignoring users' personalized preferences towards ads. To this end, adaptive ad exposure has become an appealing strategy to boost the overall performance of the feed. However, existing approaches to implementing the adaptive ad exposure still suffer from several limitations: 1) they usually fall into sub-optimal solutions because of only focusing on request-level optimization without consideration of the long-term application-level performance and constraints, 2) they neglect the necessity of keeping the game-theoretical properties of ad auctions, which may lead to anarchy in bidding, and 3) they can hardly be deployed in large-scale applications due to high computational complexity. In this paper, we focus on long-term performance optimization under hierarchical constraints in feeds and formulate the adaptive ad exposure as a Dynamic Knapsack Problem. We propose an effective approach: Hierarchically Constrained Adaptive Ad Exposure~(HCA2E). We present that HCA2E possesses desired game-theoretical properties, computational efficiency, and performance robustness. Comprehensive offline and online experiments on a leading e-commerce application demonstrate the significant performance superiority of HCA2E over representative baselines. HCA2E has also been deployed on this application to serve millions of daily users.

62.9CVMay 6
The First Controllable Bokeh Rendering Challenge at NTIRE 2026

Tim Seizinger, Florin-Alexandru Vasluianu, Jeffrey Chen et al.

This study presents the outcomes of the first Controllable Bokeh Rendering Challenge at NTIRE and highlights the most effective submitted methodologies. In total, 44 participants registered for the competition, of which 8 teams submitted valid solutions after the conclusion of the final test phase. All submissions were evaluated on unseen images, focusing on portraits and intricate subjects with complex and visually appealing bokeh phenomena. In addition to the first track focusing on established quantitative fidelity metrics, we conducted a qualitative user study with a panel of experts for a second track focusing on perceptual assessment. As this was the inaugural challenge on this topic, most of the participants focused on refining and extending the Bokehlicious baseline method.

57.3CVApr 14
OmniHands: Towards Robust 4D Hand Mesh Recovery via A Versatile Transformer

Dixuan Lin, Yuxiang Zhang, Mengcheng Li et al.

In this paper, we introduce OmniHands, a universal approach to recovering interactive hand meshes and their relative movement from monocular or multi-view inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a universal architecture with novel tokenization and contextual feature fusion strategies, capable of adapting to a variety of tasks. Specifically, we propose a Relation-aware Two-Hand Tokenization (RAT) method to embed positional relation information into the hand tokens. In this way, our network can handle both single-hand and two-hand inputs and explicitly leverage relative hand positions, facilitating the reconstruction of intricate hand interactions in real-world scenarios. As such tokenization indicates the relative relationship of two hands, it also supports more effective feature fusion. To this end, we further develop a 4D Interaction Reasoning (FIR) module to fuse hand tokens in 4D with attention and decode them into 3D hand meshes and relative temporal movements. The efficacy of our approach is validated on several benchmark datasets. The results on in-the-wild videos and real-world scenarios demonstrate the superior performances of our approach for interactive hand reconstruction. More video results can be found on the project page: https://OmniHand.github.io.

CVJun 10, 2025Code
StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

Zike Wu, Qi Yan, Xuanyu Yi et al.

Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams is crucial for numerous real-world applications. However, existing methods struggle to jointly address three key challenges: 1) processing uncalibrated inputs in real time, 2) accurately modeling dynamic scene evolution, and 3) maintaining long-term stability and computational efficiency. To this end, we introduce StreamSplat, the first fully feed-forward framework that transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner, capable of recovering scene dynamics from temporally local observations. We propose two key technical innovations: a probabilistic sampling mechanism in the static encoder for 3DGS position prediction, and a bidirectional deformation field in the dynamic decoder that enables robust and efficient dynamic modeling. Extensive experiments on static and dynamic benchmarks demonstrate that StreamSplat consistently outperforms prior works in both reconstruction quality and dynamic scene modeling, while uniquely supporting online reconstruction of arbitrarily long video streams. Code and models are available at https://github.com/nickwzk/StreamSplat.

CVMay 6, 2024Code
Video Diffusion Models: A Survey

Andrew Melnik, Michal Ljubljanac, Cong Lu et al.

Diffusion generative models have recently become a powerful technique for creating and modifying high-quality, coherent video content. This survey provides a comprehensive overview of the critical components of diffusion models for video generation, including their applications, architectural design, and temporal dynamics modeling. The paper begins by discussing the core principles and mathematical formulations, then explores various architectural choices and methods for maintaining temporal consistency. A taxonomy of applications is presented, categorizing models based on input modalities such as text prompts, images, videos, and audio signals. Advancements in text-to-video generation are discussed to illustrate the state-of-the-art capabilities and limitations of current approaches. Additionally, the survey summarizes recent developments in training and evaluation practices, including the use of diverse video and image datasets and the adoption of various evaluation metrics to assess model performance. The survey concludes with an examination of ongoing challenges, such as generating longer videos and managing computational costs, and offers insights into potential future directions for the field. By consolidating the latest research and developments, this survey aims to serve as a valuable resource for researchers and practitioners working with video diffusion models. Website: https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models

LGDec 21, 2020Code
Social NCE: Contrastive Learning of Socially-aware Motion Representations

Yuejiang Liu, Qi Yan, Alexandre Alahi

Learning socially-aware motion representations is at the core of recent advances in multi-agent problems, such as human motion forecasting and robot navigation in crowds. Despite promising progress, existing representations learned with neural networks still struggle to generalize in closed-loop predictions (e.g., output colliding trajectories). This issue largely arises from the non-i.i.d. nature of sequential prediction in conjunction with ill-distributed training data. Intuitively, if the training data only comes from human behaviors in safe spaces, i.e., from "positive" examples, it is difficult for learning algorithms to capture the notion of "negative" examples like collisions. In this work, we aim to address this issue by explicitly modeling negative examples through self-supervision: (i) we introduce a social contrastive loss that regularizes the extracted motion representation by discerning the ground-truth positive events from synthetic negative ones; (ii) we construct informative negative samples based on our prior knowledge of rare but dangerous circumstances. Our method substantially reduces the collision rates of recent trajectory forecasting, behavioral cloning and reinforcement learning algorithms, outperforming state-of-the-art methods on several benchmarks. Our code is available at https://github.com/vita-epfl/social-nce.

CVApr 25, 2024
NTIRE 2024 Quality Assessment of AI-Generated Content Challenge

Xiaohong Liu, Xiongkuo Min, Guangtao Zhai et al.

This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.

CVMar 13, 2025
MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation

Yuxiang Fu, Qi Yan, Lele Wang et al.

In this paper, we address the problem of human trajectory forecasting, which aims to predict the inherently multi-modal future movements of humans based on their past trajectories and other contextual cues. We propose a novel motion prediction conditional flow matching model, termed MoFlow, to predict K-shot future trajectories for all agents in a given scene. We design a novel flow matching loss function that not only ensures at least one of the $K$ sets of future trajectories is accurate but also encourages all $K$ sets of future trajectories to be diverse and plausible. Furthermore, by leveraging the implicit maximum likelihood estimation (IMLE), we propose a novel distillation method for flow models that only requires samples from the teacher model. Extensive experiments on the real-world datasets, including SportVU NBA games, ETH-UCY, and SDD, demonstrate that both our teacher flow model and the IMLE-distilled student model achieve state-of-the-art performance. These models can generate diverse trajectories that are physically and socially plausible. Moreover, our one-step student model is $\textbf{100}$ times faster than the teacher flow model during sampling. The code, model, and data are available at our project page: https://moflow-imle.github.io

CVJan 2, 2024
Joint Generative Modeling of Grounded Scene Graphs and Images via Diffusion Models

Bicheng Xu, Qi Yan, Renjie Liao et al.

We introduce a framework for joint grounded scene graph - image generation, a challenging task involving high-dimensional, multi-modal structured data. To effectively model this complex joint distribution, we adopt a factorized approach: first generating a grounded scene graph, followed by image generation conditioned on the generated grounded scene graph. While conditional image generation has been widely explored in the literature, our primary focus is on the generation of grounded scene graphs from noise, which provides efficient and interpretable control over the image generation process. This task requires generating plausible grounded scene graphs with heterogeneous attributes for both nodes (objects) and edges (relations among objects), encompassing continuous attributes (e.g., object bounding boxes) and discrete attributes (e.g., object and relation categories). To address this challenge, we introduce DiffuseSG, a novel diffusion model that jointly models the heterogeneous node and edge attributes. We explore different encoding strategies to effectively handle the categorical data. Leveraging a graph transformer as the denoiser, DiffuseSG progressively refines grounded scene graph representations in a continuous space before discretizing them to generate structured outputs. Additionally, we introduce an IoU-based regularization term to enhance empirical performance. Our model outperforms existing methods in grounded scene graph generation on the VG and COCO-Stuff datasets, excelling in both standard and newly introduced metrics that more accurately capture the task's complexity. Furthermore, we demonstrate the broader applicability of DiffuseSG in two important downstream tasks: 1) achieving superior results in a range of grounded scene graph completion tasks, and 2) enhancing grounded scene graph detection models by leveraging additional training samples generated by DiffuseSG.

CVJun 10, 2025
TrajFlow: Multi-modal Motion Prediction via Flow Matching

Qi Yan, Brian Zhang, Yutong Zhang et al.

Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction methods. Unlike conventional generative approaches that employ i.i.d. sampling and require multiple inference passes to capture diverse outcomes, TrajFlow predicts multiple plausible future trajectories in a single pass, significantly reducing computational overhead while maintaining coherence across predictions. Moreover, we propose a ranking loss based on the Plackett-Luce distribution to improve uncertainty estimation of predicted trajectories. Additionally, we design a self-conditioning training technique that reuses the model's own predictions to construct noisy inputs during a second forward pass, thereby improving generalization and accelerating inference. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across various key metrics, underscoring its effectiveness for safety-critical autonomous driving applications. The code and other details are available on the project website https://traj-flow.github.io/.

LGJun 4, 2025
RETRO SYNFLOW: Discrete Flow Matching for Accurate and Diverse Single-Step Retrosynthesis

Robin Yadav, Qi Yan, Guy Wolf et al.

A fundamental problem in organic chemistry is identifying and predicting the series of reactions that synthesize a desired target product molecule. Due to the combinatorial nature of the chemical search space, single-step reactant prediction -- i.e. single-step retrosynthesis -- remains challenging even for existing state-of-the-art template-free generative approaches to produce an accurate yet diverse set of feasible reactions. In this paper, we model single-step retrosynthesis planning and introduce RETRO SYNFLOW (RSF) a discrete flow-matching framework that builds a Markov bridge between the prescribed target product molecule and the reactant molecule. In contrast to past approaches, RSF employs a reaction center identification step to produce intermediate structures known as synthons as a more informative source distribution for the discrete flow. To further enhance diversity and feasibility of generated samples, we employ Feynman-Kac steering with Sequential Monte Carlo based resampling to steer promising generations at inference using a new reward oracle that relies on a forward-synthesis model. Empirically, we demonstrate \nameshort achieves $60.0 \%$ top-1 accuracy, which outperforms the previous SOTA by $20 \%$. We also substantiate the benefits of steering at inference and demonstrate that FK-steering improves top-$5$ round-trip accuracy by $19 \%$ over prior template-free SOTA methods, all while preserving competitive top-$k$ accuracy results.

SPSep 16, 2025
A Measurement Report Data-Driven Framework for Localized Statistical Channel Modeling

Xinyu Qin, Ye Xue, Qi Yan et al.

Localized statistical channel modeling (LSCM) is crucial for effective performance evaluation in digital twin-assisted network optimization. Solely relying on the multi-beam reference signal receiving power (RSRP), LSCM aims to model the localized statistical propagation environment by estimating the channel angular power spectrum (APS). However, existing methods rely heavily on drive test data with high collection costs and limited spatial coverage. In this paper, we propose a measurement report (MR) data-driven framework for LSCM, exploiting the low-cost and extensive collection of MR data. The framework comprises two novel modules. The MR localization module addresses the issue of missing locations in MR data by introducing a semi-supervised method based on hypergraph neural networks, which exploits multi-modal information via distance-aware hypergraph modeling and hypergraph convolution for location extraction. To enhance the computational efficiency and solution robustness, LSCM operates at the grid level. Compared to independently constructing geographically uniform grids and estimating channel APS, the joint grid construction and channel APS estimation module enhances robustness in complex environments with spatially non-uniform data by exploiting their correlation. This module alternately optimizes grid partitioning and APS estimation using clustering and improved sparse recovery for the ill-conditioned measurement matrix and incomplete observations. Through comprehensive experiments on a real-world MR dataset, we demonstrate the superior performance and robustness of our framework in localization and channel modeling.

LGAug 11, 2025
Score Augmentation for Diffusion Models

Liang Hou, Yuan Gao, Boyuan Jiang et al.

Diffusion models have achieved remarkable success in generative modeling. However, this study confirms the existence of overfitting in diffusion model training, particularly in data-limited regimes. To address this challenge, we propose Score Augmentation (ScoreAug), a novel data augmentation framework specifically designed for diffusion models. Unlike conventional augmentation approaches that operate on clean data, ScoreAug applies transformations to noisy data, aligning with the inherent denoising mechanism of diffusion. Crucially, ScoreAug further requires the denoiser to predict the augmentation of the original target. This design establishes an equivariant learning objective, enabling the denoiser to learn scores across varied denoising spaces, thereby realizing what we term score augmentation. We also theoretically analyze the relationship between scores in different spaces under general transformations. In experiments, we extensively validate ScoreAug on multiple benchmarks including CIFAR-10, FFHQ, AFHQv2, and ImageNet, with results demonstrating significant performance improvements over baselines. Notably, ScoreAug effectively mitigates overfitting across diverse scenarios, such as varying data scales and model capacities, while exhibiting stable convergence properties. Another advantage of ScoreAug over standard data augmentation lies in its ability to circumvent data leakage issues under certain conditions. Furthermore, we show that ScoreAug can be synergistically combined with traditional data augmentation techniques to achieve additional performance gains.

LGJul 21, 2025
Learning to Gridize: Segment Physical World by Wireless Communication Channel

Juntao Wang, Feng Yin, Tian Ding et al.

Gridization, the process of partitioning space into grids where users share similar channel characteristics, serves as a fundamental prerequisite for efficient large-scale network optimization. However, existing methods like Geographical or Beam Space Gridization (GSG or BSG) are limited by reliance on unavailable location data or the flawed assumption that similar signal strengths imply similar channel properties. We propose Channel Space Gridization (CSG), a pioneering framework that unifies channel estimation and gridization for the first time. Formulated as a joint optimization problem, CSG uses only beam-level reference signal received power (RSRP) to estimate Channel Angle Power Spectra (CAPS) and partition samples into grids with homogeneous channel characteristics. To perform CSG, we develop the CSG Autoencoder (CSG-AE), featuring a trainable RSRP-to-CAPS encoder, a learnable sparse codebook quantizer, and a physics-informed decoder based on the Localized Statistical Channel Model. On recognizing the limitations of naive training scheme, we propose a novel Pretraining-Initialization-Detached-Asynchronous (PIDA) training scheme for CSG-AE, ensuring stable and effective training by systematically addressing the common pitfalls of the naive training paradigm. Evaluations reveal that CSG-AE excels in CAPS estimation accuracy and clustering quality on synthetic data. On real-world datasets, it reduces Active Mean Absolute Error (MAE) by 30\% and Overall MAE by 65\% on RSRP prediction accuracy compared to salient baselines using the same data, while improving channel consistency, cluster sizes balance, and active ratio, advancing the development of gridization for large-scale network optimization.

LGJun 5, 2025
Neural MJD: Neural Non-Stationary Merton Jump Diffusion for Time Series Prediction

Yuanpei Gao, Qi Yan, Yan Leng et al.

While deep learning methods have achieved strong performance in time series prediction, their black-box nature and inability to explicitly model underlying stochastic processes often limit their generalization to non-stationary data, especially in the presence of abrupt changes. In this work, we introduce Neural MJD, a neural network based non-stationary Merton jump diffusion (MJD) model. Our model explicitly formulates forecasting as a stochastic differential equation (SDE) simulation problem, combining a time-inhomogeneous Itô diffusion to capture non-stationary stochastic dynamics with a time-inhomogeneous compound Poisson process to model abrupt jumps. To enable tractable learning, we introduce a likelihood truncation mechanism that caps the number of jumps within small time intervals and provide a theoretical error bound for this approximation. Additionally, we propose an Euler-Maruyama with restart solver, which achieves a provably lower error bound in estimating expected states and reduced variance compared to the standard solver. Experiments on both synthetic and real-world datasets demonstrate that Neural MJD consistently outperforms state-of-the-art deep learning and statistical learning methods.

CVMay 14, 2024
Harnessing the power of longitudinal medical imaging for eye disease prognosis using Transformer-based sequence modeling

Gregory Holste, Mingquan Lin, Ruiwen Zhou et al.

Deep learning has enabled breakthroughs in automated diagnosis from medical imaging, with many successful applications in ophthalmology. However, standard medical image classification approaches only assess disease presence at the time of acquisition, neglecting the common clinical setting of longitudinal imaging. For slow, progressive eye diseases like age-related macular degeneration (AMD) and primary open-angle glaucoma (POAG), patients undergo repeated imaging over time to track disease progression and forecasting the future risk of developing disease is critical to properly plan treatment. Our proposed Longitudinal Transformer for Survival Analysis (LTSA) enables dynamic disease prognosis from longitudinal medical imaging, modeling the time to disease from sequences of fundus photography images captured over long, irregular time periods. Using longitudinal imaging data from the Age-Related Eye Disease Study (AREDS) and Ocular Hypertension Treatment Study (OHTS), LTSA significantly outperformed a single-image baseline in 19/20 head-to-head comparisons on late AMD prognosis and 18/20 comparisons on POAG prognosis. A temporal attention analysis also suggested that, while the most recent image is typically the most influential, prior imaging still provides additional prognostic value.

CVDec 16, 2021
CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data

Qi Yan, Jianhao Zheng, Simon Reding et al.

We present a visual localization system that learns to estimate camera poses in the real world with the help of synthetic data. Despite significant progress in recent years, most learning-based approaches to visual localization target at a single domain and require a dense database of geo-tagged images to function well. To mitigate the data scarcity issue and improve the scalability of the neural localization models, we introduce TOPO-DataGen, a versatile synthetic data generation tool that traverses smoothly between the real and virtual world, hinged on the geographic camera viewpoint. New large-scale sim-to-real benchmark datasets are proposed to showcase and evaluate the utility of the said synthetic data. Our experiments reveal that synthetic data generically enhances the neural network performance on real data. Furthermore, we introduce CrossLoc, a cross-modal visual representation learning approach to pose estimation that makes full use of the scene coordinate ground truth via self-supervision. Without any extra data, CrossLoc significantly outperforms the state-of-the-art methods and achieves substantially higher real-data sample efficiency. Our code and datasets are all available at https://crossloc.github.io/.

RODec 10, 2019
Measurement Scheduling for Cooperative Localization in Resource-Constrained Conditions

Qi Yan, Li Jiang, Solmaz Kia

This paper studies the measurement scheduling problem for a group of N mobile robots moving on a flat surface that are preforming cooperative localization (CL). We consider a scenario in which due to the limited on-board resources such as battery life and communication bandwidth only a given number of relative measurements per robot are allowed at observation and update stage. Optimal selection of which teammates a robot should take a relative measurement from such that the updated joint localization uncertainty of the team is minimized is an NP-hard problem. In this paper, we propose a suboptimal greedy approach that allows each robot to choose its landmark robots locally in polynomial time. Our method, unlike the known results in the literature, does not assume full-observability of CL algorithm. Moreover, it does not require inter-robot communication at scheduling stage. That is, there is no need for the robots to collaborate to carry out the landmark robot selections. We discuss the application of our method in the context of an state-of-the-art decentralized CL algorithm and demonstrate its effectiveness through numerical simulations. Even though our solution does not come with rigorous performance guarantees, its low computational cost along with no communication requirement makes it an appealing solution for operatins with resource constrained robots.

NCNov 6, 2018
Revealing Fine Structures of the Retinal Receptive Field by Deep Learning Networks

Qi Yan, Yajing Zheng, Shanshan Jia et al.

Deep convolutional neural networks (CNNs) have demonstrated impressive performance on many visual tasks. Recently, they became useful models for the visual system in neuroscience. However, it is still not clear what are learned by CNNs in terms of neuronal circuits. When a deep CNN with many layers is used for the visual system, it is not easy to compare the structure components of CNNs with possible neuroscience underpinnings due to highly complex circuits from the retina to higher visual cortex. Here we address this issue by focusing on single retinal ganglion cells with biophysical models and recording data from animals. By training CNNs with white noise images to predict neuronal responses, we found that fine structures of the retinal receptive field can be revealed. Specifically, convolutional filters learned are resembling biological components of the retinal circuit. This suggests that a CNN learning from one single retinal cell reveals a minimal neural network carried out in this cell. Furthermore, when CNNs learned from different cells are transferred between cells, there is a diversity of transfer learning performance, which indicates that CNNs are cell-specific. Moreover, when CNNs are transferred between different types of input images, here white noise v.s. natural images, transfer learning shows a good performance, which implies that CNNs indeed capture the full computational ability of a single retinal cell for different inputs. Taken together, these results suggest that CNNs could be used to reveal structure components of neuronal circuits, and provide a powerful model for neural system identification.

MLNov 8, 2017
Revealing structure components of the retina by deep learning networks

Qi Yan, Zhaofei Yu, Feng Chen et al.

Deep convolutional neural networks (CNNs) have demonstrated impressive performance on visual object classification tasks. In addition, it is a useful model for predication of neuronal responses recorded in visual system. However, there is still no clear understanding of what CNNs learn in terms of visual neuronal circuits. Visualizing CNN's features to obtain possible connections to neuronscience underpinnings is not easy due to highly complex circuits from the retina to higher visual cortex. Here we address this issue by focusing on single retinal ganglion cells with a simple model and electrophysiological recordings from salamanders. By training CNNs with white noise images to predicate neural responses, we found that convolutional filters learned in the end are resembling to biological components of the retinal circuit. Features represented by these filters tile the space of conventional receptive field of retinal ganglion cells. These results suggest that CNN could be used to reveal structure components of neuronal circuits.