ITMay 23
Two-Stage Coded-Sliding Beam Training and QoS-Constrained Sum-Rate Maximization for SIM-Assisted Wireless CommunicationsQian Zhang, Ju Liu, Yao Ge et al.
Stacked intelligent metasurfaces (SIM) provide a cost-effective and scalable solution for large-scale antenna communications.However, efficient channel state information acquisition and phase shift optimization remain critical challenges. In this paper, we develop a unified framework of low-complexity algorithms for SIM-assisted communication systems to address these issues. Specifically, we propose a generalized two-step codebook construction (TSCC) method that leverages two-dimensional angular-domain decoupling to transform planar array beamformer design into two independent one-dimensional linear array beamformer design problems, efficiently solved via the Gerchberg-Saxton algorithm and our proposed majorization-minimization-based proximal distance (PDMM) algorithm. We further develop a two-stage coded-sliding beam training (TSCSBT) method for low-overhead and high-accuracy beam training, where error-correcting codes are embedded in the first-stage training to enhance robustness against noise, and sliding sampling is subsequently performed around the matched angular samples to improve angular resolution. The proposed framework is further extended to multi-path user channels. Finally, a variable decoupling-based block successive upper bound minimization (VD-BSUM) algorithm is proposed to directly solve the QoS-constrained sum-rate maximization problem through closed-form iterative updates with substantially reduced computational complexity. Simulation results demonstrate the effectiveness of the proposed methods in achieving precise beam pattern realization, improved beam training accuracy and angular resolution, and enhanced sum-rate performance.
CVDec 1, 2025Code
Spatiotemporal Pyramid Flow Matching for Climate EmulationJeremy Andrew Irvin, Jiaqi Han, Zikui Wang et al.
Generative models have the potential to transform the way we emulate Earth's changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at https://github.com/stanfordmlgroup/spf .
CYMar 23, 2023
Human Behavior in the Time of COVID-19: Learning from Big DataHanjia Lyu, Arsal Imtiaz, Yufei Zhao et al.
Since the World Health Organization (WHO) characterized COVID-19 as a pandemic in March 2020, there have been over 600 million confirmed cases of COVID-19 and more than six million deaths as of October 2022. The relationship between the COVID-19 pandemic and human behavior is complicated. On one hand, human behavior is found to shape the spread of the disease. On the other hand, the pandemic has impacted and even changed human behavior in almost every aspect. To provide a holistic understanding of the complex interplay between human behavior and the COVID-19 pandemic, researchers have been employing big data techniques such as natural language processing, computer vision, audio signal processing, frequent pattern mining, and machine learning. In this study, we present an overview of the existing studies on using big data techniques to study human behavior in the time of the COVID-19 pandemic. In particular, we categorize these studies into three groups - using big data to measure, model, and leverage human behavior, respectively. The related tasks, data, and methods are summarized accordingly. To provide more insights into how to fight the COVID-19 pandemic and future global catastrophes, we further discuss challenges and potential opportunities.
MTRL-SCIDec 19, 2025
QMBench: A Research Level Benchmark for Quantum Materials ResearchYanzhen Wang, Yiyang Jiang, Diana Golovanova et al.
We introduce QMBench, a comprehensive benchmark designed to evaluate the capability of large language model agents in quantum materials research. This specialized benchmark assesses the model's ability to apply condensed matter physics knowledge and computational techniques such as density functional theory to solve research problems in quantum materials science. QMBench encompasses different domains of the quantum material research, including structural properties, electronic properties, thermodynamic and other properties, symmetry principle and computational methodologies. By providing a standardized evaluation framework, QMBench aims to accelerate the development of an AI scientist capable of making creative contributions to quantum materials research. We expect QMBench to be developed and constantly improved by the research community.
LGJul 3, 2024
SFC: Achieve Accurate Fast Convolution under Low-precision ArithmeticLiulu He, Yufei Zhao, Rui Gao et al.
Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast convolution by extending the Discrete Fourier Transform (DFT) with symbolic computing, in which only additions are required to perform the transformation at specific transform points, avoiding the calculation of irrational number and reducing the requirement for precision. Additionally, we enhance convolution efficiency by introducing correction terms to convert invalid circular convolution outputs of the Fourier method into effective ones. The numerical error analysis is presented for the first time in this type of work and proves that our algorithms can provide a 3.68x multiplication reduction for 3x3 convolution, while the Winograd algorithm only achieves a 2.25x reduction with similarly low numerical errors. Experiments carried out on benchmarks and FPGA show that our new algorithms can further improve the computation efficiency of quantized models while maintaining accuracy, surpassing both the quantization-alone method and existing works on fast convolution quantization.
CVMay 7
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise GenerationYaoYang Liu, Yuechen Zhang, Wenbo Li et al.
High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).
CVJul 16, 2024
MRIo3DS-Net: A Mutually Reinforcing Images to 3D Surface RNN-like framework for model-adaptation indoor 3D reconstructionChang Li, Jiao Guo, Yufei Zhao et al.
This paper is the first to propose an end-to-end framework of mutually reinforcing images to 3D surface recurrent neural network-like for model-adaptation indoor 3D reconstruction,where multi-view dense matching and point cloud surface optimization are mutually reinforced by a RNN-like structure rather than being treated as a separate issue.The characteristics are as follows:In the multi-view dense matching module, the model-adaptation strategy is used to fine-tune and optimize a Transformer-based multi-view dense matching DNN,so that it has the higher image feature for matching and detail expression capabilities;In the point cloud surface optimization module,the 3D surface reconstruction network based on 3D implicit field is optimized by using model-adaptation strategy,which solves the problem of point cloud surface optimization without knowing normal vector of 3D surface.To improve and finely reconstruct 3D surfaces from point cloud,smooth loss is proposed and added to this module;The MRIo3DS-Net is a RNN-like framework,which utilizes the finely optimized 3D surface obtained by PCSOM to recursively reinforce the differentiable warping for optimizing MVDMM.This refinement leads to achieving better dense matching results, and better dense matching results leads to achieving better 3D surface results recursively and mutually.Hence, model-adaptation strategy can better collaborate the differences between the two network modules,so that they complement each other to achieve the better effect;To accelerate the transfer learning and training convergence from source domain to target domain,a multi-task loss function based on Bayesian uncertainty is used to adaptively adjust the weights between the two networks loss functions of MVDMM and PCSOM;In this multi-task cascade network framework,any modules can be replaced by any state-of-the-art networks to achieve better 3D reconstruction results.
ITApr 26
DRL-Based Antenna Position Optimization For MA-Assisted OTFS System Under Imperfect CSIMaoyuan Wang, Qian Zhang, Yufei Zhao et al.
In this paper, we introduce movable antenna (MA) technology into orthogonal time frequency space (OTFS) systems to enable wavelength-level antenna position optimization under imperfect channel state information (CSI), thereby mitigating deep fading. To accurately acquire CSI, we develop a sparse Bayesian learning method with variational inference (SBLVI) method. Based on estimated CSI, we formulate an MA position optimization problem with the objective of maximizing channel gain. Due to the highly non-convex character of the problem, we further develop a deep reinforcement learning (DRL) strategy to intelligently optimize MA positions. Simulation results show that the proposed SBLVI method significantly improves channel estimation accuracy over benchmark methods, and MA position optimization based on estimated CSI achieves substantially higher channel gains than the fixed-position antenna (FPA), demonstrating the effectiveness of the proposed MA-assisted OTFS system.
CVJun 15, 2024
A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and EditingMing Meng, Yufei Zhao, Bo Zhang et al.
Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.
AIMar 5, 2024
A general approach to enhance the survivability of backdoor attacks by decision path couplingYufei Zhao, Dingji Wang, Bihuan Chen et al.
Backdoor attacks have been one of the emerging security threats to deep neural networks (DNNs), leading to serious consequences. One of the mainstream backdoor defenses is model reconstruction-based. Such defenses adopt model unlearning or pruning to eliminate backdoors. However, little attention has been paid to survive from such defenses. To bridge the gap, we propose Venom, the first generic backdoor attack enhancer to improve the survivability of existing backdoor attacks against model reconstruction-based defenses. We formalize Venom as a binary-task optimization problem. The first is the original backdoor attack task to preserve the original attack capability, while the second is the attack enhancement task to improve the attack survivability. To realize the second task, we propose attention imitation loss to force the decision path of poisoned samples in backdoored models to couple with the crucial decision path of benign samples, which makes backdoors difficult to eliminate. Our extensive evaluation on two DNNs and three datasets has demonstrated that Venom significantly improves the survivability of eight state-of-the-art attacks against eight state-of-the-art defenses without impacting the capability of the original attacks.
CLOct 16, 2018
Large Language Models for Few-Shot Named Entity RecognitionYufei Zhao, Xiaoshi Zhong, Erik Cambria et al.
Named entity recognition (NER) is a fundamental task in numerous downstream applications. Recently, researchers have employed pre-trained language models (PLMs) and large language models (LLMs) to address this task. However, fully leveraging the capabilities of PLMs and LLMs with minimal human effort remains challenging. In this paper, we propose GPT4NER, a method that prompts LLMs to resolve the few-shot NER task. GPT4NER constructs effective prompts using three key components: entity definition, few-shot examples, and chain-of-thought. By prompting LLMs with these effective prompts, GPT4NER transforms few-shot NER, which is traditionally considered as a sequence-labeling problem, into a sequence-generation problem. We conduct experiments on two benchmark datasets, CoNLL2003 and OntoNotes5.0, and compare the performance of GPT4NER to representative state-of-the-art models in both few-shot and fully supervised settings. Experimental results demonstrate that GPT4NER achieves the $F_1$ of 83.15\% on CoNLL2003 and 70.37\% on OntoNotes5.0, significantly outperforming few-shot baselines by an average margin of 7 points. Compared to fully-supervised baselines, GPT4NER achieves 87.9\% of their best performance on CoNLL2003 and 76.4\% of their best performance on OntoNotes5.0. We also utilize a relaxed-match metric for evaluation and report performance in the sub-task of named entity extraction (NEE), and experiments demonstrate their usefulness to help better understand model behaviors in the NER task.