19.4CRMay 15Code
An Overview of Cyber Security Funding for Open Source SoftwareJukka Ruohonen, Gaurav Choudhary, Adam Alami
Many open source software (OSS) projects need more human resources for maintenance, improvements, and sometimes even their survival. These needs allegedly apply even to vital OSS projects that can be seen as being a part of the world's critical infrastructures. To address this resourcing problem, new funding instruments for OSS projects have been established in recent years. The paper examines two such funding bodies for OSS and the projects they have funded. The focus of both funding bodies is on software security and cyber security in general. Based on qualitative thematic analysis, the results indicate that particularly OSS supply chains, network and cryptography libraries, programming languages, and operating systems and their low-level components have been funded and thus seen as critical in terms of cyber security. In addition to the qualitative results presented, the paper makes a contribution by connecting the research branches of critical infrastructure and sustainability of OSS projects. A further contribution is made by connecting the topic examined to recent cyber security regulations. Finally, an important argument is raised that neither cyber security nor project sustainability alone can entirely explain the rationales behind the funding decisions made by the two funding bodies.
18.5SEMay 7
Empirical Derivations from an Evolving Test SuiteJukka Ruohonen, Abhishek Tiwari
The paper presents a longitudinal empirical analysis of the automated, continuous, and virtualization-based software test suite of the NetBSD operating system. The longitudinal period observed spans from the initial roll out of the test suite in the early 2010s to late 2025. According to the results, the test suite has grown continuously, currently covering over ten thousand individual test cases. Failed test cases exhibit overall stability, although there have been shorter periods marked with more frequent failures. A similar observation applies to build failures, failures of the test suite to complete, and installation failures, all of which are also captured by the NetBSD's testing framework. Finally, code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures. Although some periods exhibit larger effects, including particularly with respect to the kernel modifications, the effects are small on average. Even though only in an exploratory manner, these empirical observations contribute to efforts to draw conclusions from large-scale and evolving software test suites.
SEJul 27, 2021Code
A Large-Scale Security-Oriented Static Analysis of Python Packages in PyPIJukka Ruohonen, Kalle Hjerppe, Kalle Rindell
Different security issues are a common problem for open source packages archived to and delivered through software ecosystems. These often manifest themselves as software weaknesses that may lead to concrete software vulnerabilities. This paper examines various security issues in Python packages with static analysis. The dataset is based on a snapshot of all packages stored to the Python Package Index (PyPI). In total, over 197 thousand packages and over 749 thousand security issues are covered. Even under the constraints imposed by static analysis, (a) the results indicate prevalence of security issues; at least one issue is present for about 46% of the Python packages. In terms of the issue types, (b) exception handling and different code injections have been the most common issues. The subprocess module stands out in this regard. Reflecting the generally small size of the packages, (c) software size metrics do not predict well the amount of issues revealed through static analysis. With these results and the accompanying discussion, the paper contributes to the field of large-scale empirical studies for better understanding security problems in software ecosystems.
SEJul 24, 2020Code
A Case Study on Software Vulnerability CoordinationJukka Ruohonen, Sampsa Rauti, Sami Hyrynsalmi et al.
Context: Coordination is a fundamental tenet of software engineering. Coordination is required also for identifying discovered and disclosed software vulnerabilities with Common Vulnerabilities and Exposures (CVEs). Motivated by recent practical challenges, this paper examines the coordination of CVEs for open source projects through a public mailing list. Objective: The paper observes the historical time delays between the assignment of CVEs on a mailing list and the later appearance of these in the National Vulnerability Database (NVD). Drawing from research on software engineering coordination, software vulnerabilities, and bug tracking, the delays are modeled through three dimensions: social networks and communication practices, tracking infrastructures, and the technical characteristics of the CVEs coordinated. Method: Given a period between 2008 and 2016, a sample of over five thousand CVEs is used to model the delays with nearly fifty explanatory metrics. Regression analysis is used for the modeling. Results: The results show that the CVE coordination delays are affected by different abstractions for noise and prerequisite constraints. These abstractions convey effects from the social network and infrastructure dimensions. Particularly strong effect sizes are observed for annual and monthly control metrics, a control metric for weekends, the degrees of the nodes in the CVE coordination networks, and the number of references given in NVD for the CVEs archived. Smaller but visible effects are present for metrics measuring the entropy of the emails exchanged, traces to bug tracking systems, and other related aspects. The empirical signals are weaker for the technical characteristics. Conclusion: [...]
SESep 5, 2019Code
Empirical Notes on the Interaction Between Continuous Kernel Fuzzing and DevelopmentJukka Ruohonen, Kalle Rindell
Fuzzing has been studied and applied ever since the 1990s. Automated and continuous fuzzing has recently been applied also to open source software projects, including the Linux and BSD kernels. This paper concentrates on the practical aspects of continuous kernel fuzzing in four open source kernels. According to the results, there are over 800 unresolved crashes reported for the four kernels by the syzkaller/syzbot framework. Many of these have been reported relatively long ago. Interestingly, fuzzing-induced bugs have been resolved in the BSD kernels more rapidly. Furthermore, assertions and debug checks, use-after-frees, and general protection faults account for the majority of bug types in the Linux kernel. About 23% of the fixed bugs in the Linux kernel have either went through code review or additional testing. Finally, only code churn provides a weak statistical signal for explaining the associated bug fixing times in the Linux kernel.
CRJun 25, 2021
Crossing Cross-Domain Paths in the Current WebJukka Ruohonen, Joonas Salovaara, Ville Leppänen
The loading of resources from third-parties has evoked new security and privacy concerns about the current world wide web. Building on the concepts of forced and implicit trust, this paper examines cross-domain transmission control protocol (TCP) connections that are initiated to domains other than the domain queried with a web browser. The dataset covers nearly ten thousand domains and over three hundred thousand TCP connections initiated by querying popular Finnish websites and globally popular sites. According to the results, (i) cross-domain connections are extremely common in the current Web. (ii) Most of these transmit encrypted content, although mixed content delivery is relatively common; many of the cross-domain connections deliver unencrypted content at the same time. (iii) Many of the cross-domain connections are initiated to known web advertisement domains, but a much larger share traces to social media platforms and cloud infrastructures. Finally, (iv) the results differ slightly between the Finnish web sites sampled and the globally popular sites. With these results, the paper contributes to the ongoing work for better understanding cross-domain connections and dependencies in the world wide web.
SEApr 30, 2020
Extracting Layered Privacy Language Purposes from Web ServicesKalle Hjerppe, Jukka Ruohonen, Ville Leppänen
Web services are important in the processing of personal data in the World Wide Web. In light of recent data protection regulations, this processing raises a question about consent or other basis of legal processing. While a consent must be informed, many web services fail to provide enough information for users to make informed decisions. Privacy policies and privacy languages are one way for addressing this problem; the former document how personal data is processed, while the latter describe this processing formally. In this paper, the socalled Layered Privacy Language (LPL) is coupled with web services in order to express personal data processing with a formal analysis method that seeks to generate the processing purposes for privacy policies. To this end, the paper reviews the background theory as well as proposes a method and a concrete tool. The results are demonstrated with a small case study.
SEMar 22, 2020
Annotation-Based Static Analysis for Personal Data ProtectionKalle Hjerppe, Jukka Ruohonen, Ville Leppänen
This paper elaborates the use of static source code analysis in the context of data protection. The topic is important for software engineering in order for software developers to improve the protection of personal data during software development. To this end, the paper proposes a design of annotating classes and functions that process personal data. The design serves two primary purposes: on one hand, it provides means for software developers to document their intent; on the other hand, it furnishes tools for automatic detection of potential violations. This dual rationale facilitates compliance with the General Data Protection Regulation (GDPR) and other emerging data protection and privacy regulations. In addition to a brief review of the state-of-the-art of static analysis in the data protection context and the design of the proposed analysis method, a concrete tool is presented to demonstrate a practical implementation for the Java programming language.
SEJul 17, 2019
The General Data Protection Regulation: Requirements, Architectures, and ConstraintsKalle Hjerppe, Jukka Ruohonen, Ville Leppänen
The General Data Protection Regulation (GDPR) in the European Union is the most famous recently enacted privacy regulation. Despite of the regulation's legal, political, and technological ramifications, relatively little research has been carried out for better understanding the GDPR's practical implications for requirements engineering and software architectures. Building on a grounded theory approach with close ties to the Finnish software industry, this paper contributes to the sealing of this gap in previous research. Three questions are asked and answered in the context of software development organizations. First, the paper elaborates nine practical constraints under which many small and medium-sized enterprises (SMEs) often operate when implementing solutions that address the new regulatory demands. Second, the paper elicits nine regulatory requirements from the GDPR for software architectures. Third, the paper presents an implementation for a software architecture that complies both with the requirements elicited and the constraints elaborated.
SEDec 13, 2018
A Demand-Side Viewpoint to Software Vulnerabilities in WordPress PluginsJukka Ruohonen
WordPress has long been the most popular content management system (CMS). This CMS powers millions and millions of websites. Although WordPress has had a particularly bad track record in terms of security, in recent years many of the well-known security risks have transmuted from the core WordPress to the numerous plugins and themes written for the CMS. Given this background, the paper analyzes known software vulnerabilities discovered from WordPress plugins. A demand-side viewpoint was used to motivate the analysis; the basic hypothesis is that plugins with large installation bases have been affected by multiple vulnerabilities. As the hypothesis also holds according to the empirical results, the paper contributes to the recent discussion about common security folklore. A few general insights are also provided about the relation between software vulnerabilities and software maintenance.
SEOct 31, 2018
An Empirical Analysis of Vulnerabilities in Python Packages for Web ApplicationsJukka Ruohonen
This paper examines software vulnerabilities in common Python packages used particularly for web development. The empirical dataset is based on the PyPI package repository and the so-called Safety DB used to track vulnerabilities in selected packages within the repository. The methodological approach builds on a release-based time series analysis of the conditional probabilities for the releases of the packages to be vulnerable. According to the results, many of the Python vulnerabilities observed seem to be only modestly severe; input validation and cross-site scripting have been the most typical vulnerabilities. In terms of the time series analysis based on the release histories, only the recent past is observed to be relevant for statistical predictions; the classical Markov property holds.
CRSep 15, 2018
On the Integrity of Cross-Origin JavaScriptsJukka Ruohonen, Joonas Salovaara, Ville Leppänen
The same-origin policy is a fundamental part of the Web. Despite the restrictions imposed by the policy, embedding of third-party JavaScript code is allowed and commonly used. Nothing is guaranteed about the integrity of such code. To tackle this deficiency, solutions such as the subresource integrity standard have been recently introduced. Given this background, this paper presents the first empirical study on the temporal integrity of cross-origin JavaScript code. According to the empirical results based on a ten day polling period of over 35 thousand scripts collected from popular websites, (i) temporal integrity changes are relatively common; (ii) the adoption of the subresource integrity standard is still in its infancy; and (iii) it is possible to statistically predict whether a temporal integrity change is likely to occur. With these results and the accompanying discussion, the paper contributes to the ongoing attempts to better understand security and privacy in the current Web.
IRSep 5, 2018
Toward Validation of Textual Information Retrieval Techniques for Software WeaknessesJukka Ruohonen, Ville Leppänen
This paper presents a preliminary validation of common textual information retrieval techniques for mapping unstructured software vulnerability information to distinct software weaknesses. The validation is carried out with a dataset compiled from four software repositories tracked in the Snyk vulnerability database. According to the results, the information retrieval techniques used perform unsatisfactorily compared to regular expression searches. Although the results vary from a repository to another, the preliminary validation presented indicates that explicit referencing of vulnerability and weakness identifiers is preferable for concrete vulnerability tracking. Such referencing allows the use of keyword-based searches, which currently seem to yield more consistent results compared to information retrieval techniques. Further validation work is required for improving the precision of the techniques, however.
CRMay 24, 2018
A Bug Bounty Perspective on the Disclosure of Web VulnerabilitiesJukka Ruohonen, Luca Allodi
Bug bounties have become increasingly popular in recent years. This paper discusses bug bounties by framing these theoretically against so-called platform economy. Empirically the interest is on the disclosure of web vulnerabilities through the Open Bug Bounty (OBB) platform between 2015 and late 2017. According to the empirical results based on a dataset covering nearly 160 thousand web vulnerabilities, (i) OBB has been successful as a community-based platform for the dissemination of web vulnerabilities. The platform has also attracted many productive hackers, (ii) but there exists a large productivity gap, which likely relates to (iii) a knowledge gap and the use of automated tools for web vulnerability discovery. While the platform (iv) has been exceptionally fast to evaluate new vulnerability submissions, (v) the patching times of the web vulnerabilities disseminated have been long. With these empirical results and the accompanying theoretical discussion, the paper contributes to the small but rapidly growing amount of research on bug bounties. In addition, the paper makes a practical contribution by discussing the business models behind bug bounties from the viewpoints of platforms, ecosystems, and vulnerability markets.
CRMay 16, 2018
Investigating the Agility Bias in DNS Graph MiningJukka Ruohonen, Ville Leppänen
The concept of agile domain name system (DNS) refers to dynamic and rapidly changing mappings between domain names and their Internet protocol (IP) addresses. This empirical paper evaluates the bias from this kind of agility for DNS-based graph theoretical data mining applications. By building on two conventional metrics for observing malicious DNS agility, the agility bias is observed by comparing bipartite DNS graphs to different subgraphs from which vertices and edges are removed according to two criteria. According to an empirical experiment with two longitudinal DNS datasets, irrespective of the criterion, the agility bias is observed to be severe particularly regarding the effect of outlying domains hosted and delivered via content delivery networks and cloud computing services. With these observations, the paper contributes to the research domains of cyber security and DNS mining. In a larger context of applied graph mining, the paper further elaborates the practical concerns related to the learning of large and dynamic bipartite graphs.
CRApr 20, 2018
An Empirical Survey on the Early Adoption of DNS Certification Authority AuthorizationJukka Ruohonen
A new certification authority authorization (CAA) resource record for the domain name system (DNS) was standardized in 2013. Motivated by the later 2017 decision to enforce mandatory CAA checking for most certificate authorities, this paper surveys the early adoption of CAA by using an empirical sample collected from the Alexa's top-million domains. According to the results, (i) the adoption of CAA is still at a modest level; only a little below two percent of the popular domains sampled have adopted CAA. Among the domains that have adopted CAA, (ii) authorizations dealing with wildcard certificates are rare compared to conventional certificates. Interestingly, (iii) the results only partially reflect the market structure of the global certificate business. With these timely results, the paper contributes to the ongoing large-scale empirical research on the use of encryption technologies.
CRJan 23, 2018
Whose Hands Are in the Finnish Cookie Jar?Jukka Ruohonen, Ville Leppänen
Web cookies are ubiquitously used to track and profile the behavior of users. Although there is a solid empirical foundation for understanding the use of cookies in the global world wide web, thus far, limited attention has been devoted for country-specific and company-level analysis of cookies. To patch this limitation in the literature, this paper investigates persistent third-party cookies used in the Finnish web. The exploratory results reveal some similarities and interesting differences between the Finnish and the global web---in particular, popular Finnish web sites are mostly owned by media companies, which have established their distinct partnerships with online advertisement companies. The results reported can be also reflected against current and future privacy regulation in the European Union.
CRJan 3, 2018
A Look at the Time Delays in CVSS Vulnerability ScoringJukka Ruohonen
This empirical paper examines the time delays that occur between the publication of Common Vulnerabilities and Exposures (CVEs) in the National Vulnerability Database (NVD) and the Common Vulnerability Scoring System (CVSS) information attached to published CVEs. According to the empirical results based on regularized regression analysis of over eighty thousand archived vulnerabilities, (i) the CVSS content does not statistically influence the time delays, which, however, (ii) are strongly affected by a decreasing annual trend. In addition to these results, the paper contributes to the empirical research tradition of software vulnerabilities by a couple of insights on misuses of statistical methodology.
SEOct 16, 2017
How PHP Releases Are Adopted in the Wild?Jukka Ruohonen, Ville Leppänen
This empirical paper examines the adoption of PHP releases in the the contemporary world wide web. Motivated by continuous software engineering practices and software traceability improvements for release engineering, the empirical analysis is based on big data collected by web crawling. According to the empirical results based on discrete time-homogeneous Markov chain (DTMC) analysis, (i)~adoption of PHP releases has been relatively uniform across the domains observed, (ii) which tend to also adopt either old or new PHP releases relatively infrequently. Although there are outliers, (iii) downgrading of PHP releases is generally rare. To some extent, (iv) the results vary between the recent history from 2016 to early 2017 and the long-run evolution in the 2010s. In addition to these empirical results, the paper contributes to the software evolution and release engineering research traditions by elaborating the applied use of DTMCs for systematic empirical tracing of online software deployments.
CROct 16, 2017
Classifying Web Exploits with Topic ModelingJukka Ruohonen
This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost factor behind this classification performance. In addition to these empirical results, the paper contributes to the research tradition of enhancing software vulnerability information with text mining, providing also a few scholarly observations about the potential for semi-automatic classification of exploits in the existing tracking infrastructures.