LGMar 3
Rethinking Time Series Domain Generalization via Structure-Stratified CalibrationJinyang Li, Shuhao Mei, Xiaoyu Xiao et al.
For time series arising from latent dynamical systems, existing cross-domain generalization methods commonly assume that samples are comparably meaningful within a shared representation space. In real-world settings, however, different datasets often originate from structurally heterogeneous families of dynamical systems, leading to fundamentally distinct feature distributions. Under such circumstances, performing global alignment while neglecting structural differences is highly prone to establishing spurious correspondences and inducing negative transfer. From the new perspective of cross-domain structural correspondence failure, we revisit this problem and propose a structurally stratified calibration framework. This approach explicitly distinguishes structurally consistent samples and performs amplitude calibration exclusively within structurally compatible sample clusters, thereby effectively alleviating generalization failures caused by structural incompatibility. Notably, the proposed framework achieves substantial performance improvements through a concise and computationally efficient calibration strategy. Evaluations on 19 public datasets (100.3k samples) demonstrate that SSCF significantly outperforms strong baselines under the zero-shot setting. These results confirm that establishing structural consistency prior to alignment constitutes a more reliable and effective pathway for improving cross-domain generalization of time series governed by latent dynamical systems.
NEJan 5, 2025
LLMs Help Alleviate the Cross-Subject Variability in Brain Signal and Language AlignmentYifei Liu, Hengwei Ye, Shuhang Li
Decoding human activity from EEG signals has long been a popular research topic. While recent studies have increasingly shifted focus from single-subject to cross-subject analysis, few have explored the model's ability to perform zero-shot predictions on EEG signals from previously unseen subjects. This research aims to investigate whether deep learning methods can capture subject-independent semantic information inherent in human EEG signals. Such insights are crucial for Brain-Computer Interfaces (BCI) because, on one hand, they demonstrate the model's robustness against subject-specific temporal biases, and on the other, they significantly enhance the generalizability of downstream tasks. We employ Large Language Models (LLMs) as denoising agents to extract subject-independent semantic features from noisy EEG signals. Experimental results, including ablation studies, highlight the pivotal role of LLMs in decoding subject-independent semantic information from noisy EEG data. We hope our findings will contribute to advancing BCI research and assist both academia and industry in applying EEG signals to a broader range of applications.
LGAug 13, 2025
FM4NPP: A Scaling Foundation Model for Nuclear and Particle PhysicsDavid Park, Shuhang Li, Yi Huang et al.
Large language models have revolutionized artificial intelligence by enabling large, generalizable models trained through self-supervision. This paradigm has inspired the development of scientific foundation models (FMs). However, applying this capability to experimental particle physics is challenging due to the sparse, spatially distributed nature of detector data, which differs dramatically from natural language. This work addresses if an FM for particle physics can scale and generalize across diverse tasks. We introduce a new dataset with more than 11 million particle collision events and a suite of downstream tasks and labeled data for evaluation. We propose a novel self-supervised training method for detector data and demonstrate its neural scalability with models that feature up to 188 million parameters. With frozen weights and task-specific adapters, this FM consistently outperforms baseline models across all downstream tasks. The performance also exhibits robust data-efficient adaptation. Further analysis reveals that the representations extracted by the FM are task-agnostic but can be specialized via a single linear mapping for different downstream tasks.
INS-DETNov 18, 2024
Variable Rate Neural Compression for Sparse Detector DataYi Huang, Yeonju Go, Jin Huang et al.
High-energy large-scale particle colliders generate data at extraordinary rates. Developing real-time high-throughput data compression algorithms to reduce data volume and meet the bandwidth requirement for storage has become increasingly critical. Deep learning is a promising technology that can address this challenging topic. At the newly constructed sPHENIX experiment at the Relativistic Heavy Ion Collider, a Time Projection Chamber (TPC) serves as the main tracking detector, which records three-dimensional particle trajectories in a volume of a gas-filled cylinder. In terms of occupancy, the resulting data flow can be very sparse reaching $10^{-3}$ for proton-proton collisions. Such sparsity presents a challenge to conventional learning-free lossy compression algorithms, such as SZ, ZFP, and MGARD. In contrast, emerging deep learning-based models, particularly those utilizing convolutional neural networks for compression, have outperformed these conventional methods in terms of compression ratios and reconstruction accuracy. However, research on the efficacy of these deep learning models in handling sparse datasets, like those produced in particle colliders, remains limited. Furthermore, most deep learning models do not adapt their processing speeds to data sparsity, which affects efficiency. To address this issue, we propose a novel approach for TPC data compression via key-point identification facilitated by sparse convolution. Our proposed algorithm, BCAE-VS, achieves a $75\%$ improvement in reconstruction accuracy with a $10\%$ increase in compression ratio over the previous state-of-the-art model. Additionally, BCAE-VS manages to achieve these results with a model size over two orders of magnitude smaller. Lastly, we have experimentally verified that as sparsity increases, so does the model's throughput.
CRNov 27, 2019
PacketCGAN: Exploratory Study of Class Imbalance for Encrypted Traffic Classification Using CGANPan Wang, Shuhang Li, Feng Ye et al.
With more and more adoption of Deep Learning (DL) in the field of image processing, computer vision and NLP, researchers have begun to apply DL to tackle with encrypted traffic classification problems. Although these methods can automatically extract traffic features to overcome the difficulty of traditional classification methods like DPI in terms of feature engineering, a large amount of data is needed to learn the characteristics of various types of traffic. Therefore, the performance of classification model always significantly depends on the quality of datasets. Nevertheless, the building of datasets is a time-consuming and costly task, especially encrypted traffic data. Apparently, it is often more difficult to collect a large amount of traffic samples of those unpopular encrypted applications than well-known, which leads to the problem of class imbalance between major and minor encrypted applications in datasets. In this paper, we proposed a novel traffic data augmenting method called PacketCGAN using Conditional GAN. As a generative model, PacketCGAN exploit the benefit of CGAN to generate specified traffic to address the problem of the datasets' imbalance. As a proof of concept, three classical DL models like Convolutional Neural Network (CNN) were adopted and designed to classify four encrypted traffic datasets augmented by Random Over Sampling (ROS), SMOTE(Synthetic Minority Over-sampling Techinique) , vanilla GAN and PacketCGAN respectively based on two public datasets: ISCX2012 and USTC-TFC2016. The experimental evaluation results demonstrate that DL based encrypted traffic classifier over dataset augmented by PacketCGAN can achieve better performance than the others.