Frank Neffke

7.6GNApr 8

Using digital traces to analyze software work: skills, careers and programming languages

Xiangnan Feng, Johannes Wachs, Simone Daniotti et al.

Recent waves of technological transformation are reshaping work in uncertain and hard-to-predict ways. However, jobs at the forefront of the digitizing economy offer an early glimpse of these changes and leave rich activity traces. We exploit such traces in tens of millions of Question and Answer posts on Stack Overflow for the creation of a fine-grained taxonomy of software skills to analyze human capital in the global software industry. Constructing a software skill space that maps relations among these skills reveals that real-world software jobs demand highly coherent skill sets and that programmers learn through a process of related diversification. The latter process often leads to the acquisition of lower-value skills. However, when programmers use Python they preferentially target higher-value skills, offering a potential explanation for Python's successful rise as a dominant general purpose language.

MEMay 3, 2021

What can the millions of random treatments in nonexperimental data reveal about causes?

Andre F. Ribeiro, Frank Neffke, Ricardo Hausmann

We propose a new method to estimate causal effects from nonexperimental data. Each pair of sample units is first associated with a stochastic 'treatment' - differences in factors between units - and an effect - a resultant outcome difference. It is then proposed that all such pairs can be combined to provide more accurate estimates of causal effects in observational data, provided a statistical model connecting combinatorial properties of treatments to the accuracy and unbiasedness of their effects. The article introduces one such model and a Bayesian approach to combine the $O(n^2)$ pairwise observations typically available in nonexperimnetal data. This also leads to an interpretation of nonexperimental datasets as incomplete, or noisy, versions of ideal factorial experimental designs. This approach to causal effect estimation has several advantages: (1) it expands the number of observations, converting thousands of individuals into millions of observational treatments; (2) starting with treatments closest to the experimental ideal, it identifies noncausal variables that can be ignored in the future, making estimation easier in each subsequent iteration while departing minimally from experiment-like conditions; (3) it recovers individual causal effects in heterogeneous populations. We evaluate the method in simulations and the National Supported Work (NSW) program, an intensively studied program whose effects are known from randomized field experiments. We demonstrate that the proposed approach recovers causal effects in common NSW samples, as well as in arbitrary subpopulations and an order-of-magnitude larger supersample with the entire national program data, outperforming Statistical, Econometrics and Machine Learning estimators in all cases...

Frank Neffke

2 Papers