82.0LGApr 18Code
Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian CalibrationChongsheng Zhang, Hao Wang, Zelong Yu et al.
Imbalanced data is commonly present in real-world applications. While data synthesis can effectively mitigate the data scarcity problem of rare-classes, and LLMs have revolutionized text generation, the application of LLMs to relational/structured tabular data synthesis remains underexplored. Moreover, existing approaches lack an effective feedback mechanism that can guide LLMs towards continuously optimizing the quality of the generated data throughout the synthesis process. In this work, we propose RDDG, Relational Data generator with Dynamic Guidance, which is a unified in-context learning framework that employs progressive chain-of-thought (CoT) steps to generate tabular data for enhancing downstream imbalanced classification performance. RDDG first uses core set selection to identify representative samples from the original data, then utilizes in-context learning to discover the inherent patterns and correlations among attributes within the core set, and subsequently generates tabular data while preserving the aforementioned constraints. More importantly, it incorporates a self-reinforcing feedback mechanism that provides automatic assessments on the quality of the generated data, enabling continuous quality optimization throughout the generation process. Experimental results on multiple real and synthetic datasets demonstrate that RDDG outperforms existing approaches in both data fidelity and downstream imbalanced classification performance. We make our code available at https://github.com/cszhangLMU/RDDG.
LGAug 1, 2024
A Systematic Review on Long-Tailed LearningChongsheng Zhang, George Almpanidis, Gaojuan Fan et al.
Long-tailed data is a special type of multi-class imbalanced data with a very large amount of minority/tail classes that have a very significant combined influence. Long-tailed learning aims to build high-performance models on datasets with long-tailed distributions, which can identify all the classes with high accuracy, in particular the minority/tail classes. It is a cutting-edge research direction that has attracted a remarkable amount of research effort in the past few years. In this paper, we present a comprehensive survey of latest advances in long-tailed visual learning. We first propose a new taxonomy for long-tailed learning, which consists of eight different dimensions, including data balancing, neural architecture, feature enrichment, logits adjustment, loss function, bells and whistles, network optimization, and post hoc processing techniques. Based on our proposed taxonomy, we present a systematic review of long-tailed learning methods, discussing their commonalities and alignable differences. We also analyze the differences between imbalance learning and long-tailed learning approaches. Finally, we discuss prospects and future directions in this field.
CLAug 28, 2025Code
GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text GenerationYuanhao Ding, Esteban Garces Arias, Meimingwei Li et al.
Open-ended text generation faces a critical challenge: balancing coherence with diversity in LLM outputs. While contrastive search-based decoding strategies have emerged to address this trade-off, their practical utility is often limited by hyperparameter dependence and high computational costs. We introduce GUARD, a self-adaptive decoding method that effectively balances these competing objectives through a novel "Glocal" uncertainty-driven framework. GUARD combines global entropy estimates with local entropy deviations to integrate both long-term and short-term uncertainty signals. We demonstrate that our proposed global entropy formulation effectively mitigates abrupt variations in uncertainty, such as sudden overconfidence or high entropy spikes, and provides theoretical guarantees of unbiasedness and consistency. To reduce computational overhead, we incorporate a simple yet effective token-count-based penalty into GUARD. Experimental results demonstrate that GUARD achieves a good balance between text diversity and coherence, while exhibiting substantial improvements in generation speed. In a more nuanced comparison study across different dimensions of text quality, both human and LLM evaluators validated its remarkable performance. Our code is available at https://github.com/YecanLee/GUARD.
IRMay 4, 2025Code
Explainable Coarse-to-Fine Ancient Manuscript Duplicates DiscoveryChongsheng Zhang, Shuwen Wu, Yingqi Chen et al.
Ancient manuscripts are the primary source of ancient linguistic corpora. However, many ancient manuscripts exhibit duplications due to unintentional repeated publication or deliberate forgery. The Dead Sea Scrolls, for example, include counterfeit fragments, whereas Oracle Bones (OB) contain both republished materials and fabricated specimens. Identifying ancient manuscript duplicates is of great significance for both archaeological curation and ancient history study. In this work, we design a progressive OB duplicate discovery framework that combines unsupervised low-level keypoints matching with high-level text-centric content-based matching to refine and rank the candidate OB duplicates with semantic awareness and interpretability. We compare our model with state-of-the-art content-based image retrieval and image matching methods, showing that our model yields comparable recall performance and the highest simplified mean reciprocal rank scores for both Top-5 and Top-15 retrieval results, and with significantly accelerated computation efficiency. We have discovered over 60 pairs of new OB duplicates in real-world deployment, which were missed by domain experts for decades. Code, model and real-world results are available at: https://github.com/cszhangLMU/OBD-Finder/.
LGSep 1, 2021
An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance ClassificationChongsheng Zhang, Paolo Soda, Jingjun Bi et al.
In predictive tasks, real-world datasets often present different degrees of imbalanced (i.e., long-tailed or skewed) distributions. While the majority (the head) classes have sufficient samples, the minority (the tail) classes can be under-represented by a rather limited number of samples. Data pre-processing has been shown to be very effective in dealing with such problems. On one hand, data re-sampling is a common approach to tackling class imbalance. On the other hand, dimension reduction, which reduces the feature space, is a conventional technique for reducing noise and inconsistencies in a dataset. However, the possible synergy between feature selection and data re-sampling for high-performance imbalance classification has rarely been investigated before. To address this issue, we carry out a comprehensive empirical study on the joint influence of feature selection and re-sampling on two-class imbalance classification. Specifically, we study the performance of two opposite pipelines for imbalance classification by applying feature selection before or after data re-sampling. We conduct a large number of experiments, with a total of 9225 tests, on 52 publicly available datasets, using 9 feature selection methods, 6 re-sampling approaches for class imbalance learning, and 3 well-known classification algorithms. Experimental results show that there is no constant winner between the two pipelines; thus both of them should be considered to derive the best performing model for imbalance classification. We find that the performance of an imbalance classification model not only depends on the classifier adopted and the ratio between the number of majority and minority samples, but also depends on the ratio between the number of samples and features. Overall, this study should provide new reference value for researchers and practitioners in imbalance learning.