CRAIMay 28, 2025

Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift?

arXiv:2505.22843v2h-index: 2
Originality Incremental advance
AI Analysis

This addresses the operational reliability of malware classifiers for cybersecurity practitioners, highlighting incremental improvements in evaluation methods.

The paper tackled the problem of whether Android malware classifiers maintain reliable confidence estimates under distribution shifts, and found that state-of-the-art frameworks show fragility across datasets, indicating a need for reevaluation.

The performance figures of modern drift-adaptive malware classifiers appear promising, but does this translate to genuine operational reliability? The standard evaluation paradigm primarily focuses on baseline performance metrics, neglecting confidence-error alignment and operational stability. While TESSERACT established the importance of temporal evaluation, we take a complementary direction by investigating whether malware classifiers maintain reliable and stable confidence estimates under distribution shifts and exploring the tensions between scientific advancement and practical impacts when they do not. We propose AURORA, a framework to evaluate malware classifiers based on their confidence quality and operational resilience. AURORA subjects the confidence profile of a given model to verification to assess the reliability of its estimates. Unreliable confidence estimates erode operational trust, waste valuable annotation budget on non-informative samples for active learning, and leave error-prone instances undetected in selective classification. AURORA is complemented by a set of metrics designed to go beyond point-in-time performance, striving towards a more holistic assessment of operational stability throughout temporal evaluation periods. The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes