Cross-Ecosystem Vulnerability Analysis for Python Applications
This work addresses a critical security issue for Python developers and users by improving vulnerability detection accuracy, though it is incremental as it builds on existing analysis techniques.
The paper tackled the problem of accurately identifying vulnerabilities in Python applications that depend on native libraries, by developing a provenance-aware analysis approach that resolves vendored libraries to specific OS package versions or upstream releases, resulting in the identification of 39 directly vulnerable packages (47M+ monthly downloads) and 312 indirectly vulnerable client packages, with up to 97% false positive reduction compared to existing methods.
Python applications depend on native libraries that may be vendored within package distributions or installed on the host system. When vulnerabilities are discovered in these libraries, determining which Python packages are affected requires cross-ecosystem analysis spanning Python dependency graphs and OS package versions. Current vulnerability scanners produce false negatives by missing vendored vulnerabilities and false positives by ignoring security patches backported by OS distributions. We present a provenance-aware vulnerability analysis approach that resolves vendored libraries to specific OS package versions or upstream releases. Our approach queries vendored libraries against a database of historical OS package artifacts using content-based hashing, and applies library-specific dynamic analyses to extract version information from binaries built from upstream source. We then construct cross-ecosystem call graphs by stitching together Python and binary call graphs across dependency boundaries, enabling reachability analysis of vulnerable functions. Evaluating on 100,000 Python packages and 10 known CVEs associated with third-party native dependencies, we identify 39 directly vulnerable packages (47M+ monthly downloads) and 312 indirectly vulnerable client packages affected through dependency chains. Our analysis achieves up to 97% false positive reduction compared to upstream version matching.