CVApr 16, 2024Code
Automated Evaluation of Large Vision-Language Models on Self-driving Corner CasesKai Chen, Yanze Li, Wenhua Zhang et al.
Large Vision-Language Models (LVLMs) have received widespread attention for advancing the interpretable self-driving. Existing evaluations of LVLMs primarily focus on multi-faceted capabilities in natural circumstances, lacking automated and quantifiable assessment for self-driving, let alone the severe road corner cases. In this work, we propose CODA-LM, the very first benchmark for the automatic evaluation of LVLMs for self-driving corner cases. We adopt a hierarchical data structure and prompt powerful LVLMs to analyze complex driving scenes and generate high-quality pre-annotations for the human annotators, while for LVLM evaluation, we show that using the text-only large language models (LLMs) as judges reveals even better alignment with human preferences than the LVLM judges. Moreover, with our CODA-LM, we build CODA-VLM, a new driving LVLM surpassing all open-sourced counterparts on CODA-LM. Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task. We hope CODA-LM can become the catalyst to promote interpretable self-driving empowered by LVLMs.
LGJan 15
BPE: Behavioral Profiling EnsembleYanxin Liu, Yunqi Zhang
In the field of machine learning, ensemble learning is widely recognized as a pivotal strategy for pushing the boundaries of predictive performance. Traditional static ensemble methods typically assign weights by treating each base learner as a whole, thereby overlooking that individual models exhibit varying competence across different regions of the instance space. Dynamic Ensemble Selection (DES) was introduced to address this limitation. However, both static and dynamic approaches predominantly rely on inter-model differences as the basis for integration; this inter-model perspective neglects models' intrinsic characteristics and often requires heavy reliance on reference sets for competence estimation. We propose the Behavioral Profiling Ensemble (BPE) framework, which introduces a model-centric integration paradigm. Unlike traditional methods, BPE constructs an intrinsic behavioral profile $\mathcal{P}_k$ for each model and derives aggregation weights from the deviation between a model's response to a test instance and its established profile; in this work, we instantiate $\mathcal{P}_k$ with entropy-based summary statistics (e.g., mean and variance). Extensive experiments on 42 real-world datasets show that BPE-derived algorithms outperform state-of-the-art DES baselines, increasing predictive accuracy while reducing computational and storage overhead.
LGMay 12, 2025
Relative Overfitting and Accept-Reject FrameworkYanxin Liu, Yunqi Zhang
The scaling of Large Language Models (LLMs) currently faces significant challenges. Model assembly is widely considered a promising solution to break through these performance bottlenecks. However, current ensembling methods are primarily guided by the statistical expectation that combining multiple models over large samples will lead to performance gains. We propose an ensemble framework that transitions from such stochastic, sample-dependent methods to a regular, controllable approach based on fine-grained model segmentation. This regularity governs how models are segmented to ensure performance improvement, how the magnitude of this improvement varies with model selection, and what factors determine its theoretical maximum. To formalize this pattern, we introduce the concept of'relative overfitting,' which is derived from the performance discrepancies between constituent models and builds a bridge between ensemble outcomes and the inherent attributes of these models. We detail the patterns of this framework within the domain of NLP and briefly describe its extensibility to other fields, such as computer vision (CV) and AI for science. Our approach was validated using both custom-built and pre-trained mainstream models across diverse benchmarks, including language modeling, long-context tasks, and question-answering (QA). The results indicate that the ensemble rules we proposed are generally effective and that we provide a rigorous proof of these rules in certain experimental scenarios. The proposed framework offers a new perspective for understanding ensemble theory and provides a systematic approach to addressing the performance bottlenecks of LLMs.
MEAug 6, 2025
Accept-Reject LassoYanxin Liu, Yunqi Zhang
The Lasso method is known to exhibit instability in the presence of highly correlated features, often leading to an arbitrary selection of predictors. This issue manifests itself in two primary error types: the erroneous omission of features that lack a true substitutable relationship (falsely redundant features) and the inclusion of features with a true substitutable relationship (truly redundant features). Although most existing methods address only one of these challenges, we introduce the Accept-Reject Lasso (ARL), a novel approach that resolves this dilemma. ARL operationalizes an Accept-Reject framework through a fine-grained analysis of feature selection across data subsets. This framework is designed to partition the output of an ensemble method into beneficial and detrimental components through fine-grained analysis. The fundamental challenge for Lasso is that inter-variable correlation obscures the true sources of information. ARL tackles this by first using clustering to identify distinct subset structures within the data. It then analyzes Lasso's behavior across these subsets to differentiate between true and spurious correlations. For truly correlated features, which induce multicollinearity, ARL tends to select a single representative feature and reject the rest to ensure model stability. Conversely, for features linked by spurious correlations, which may vanish in certain subsets, ARL accepts those that Lasso might have incorrectly omitted. The distinct patterns arising from true versus spurious correlations create a divisible separation. By setting an appropriate threshold, our framework can effectively distinguish between these two phenomena, thereby maximizing the inclusion of informative variables while minimizing the introduction of detrimental ones. We illustrate the efficacy of the proposed method through extensive simulation and real-data experiments.