Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel Data
This addresses the challenge of modeling complex, heterogeneous panel data for applications like transportation demand prediction, though it is incremental as it builds on existing smoothing and clustering methods.
The paper tackles the problem of estimating high-dimensional and nonlinear panel data models by developing a data-driven smoothing technique that clusters individuals based on similarity, using weighted observations from others. The result shows that the estimator greatly improves prediction in simulations and outperforms existing linear panel data estimators in out-of-sample prediction on a real-world transportation dataset.
In this paper we develop a data-driven smoothing technique for high-dimensional and non-linear panel data models. We allow for individual specific (non-linear) functions and estimation with econometric or machine learning methods by using weighted observations from other individuals. The weights are determined by a data-driven way and depend on the similarity between the corresponding functions and are measured based on initial estimates. The key feature of such a procedure is that it clusters individuals based on the distance / similarity between them, estimated in a first stage. Our estimation method can be combined with various statistical estimation procedures, in particular modern machine learning methods which are in particular fruitful in the high-dimensional case and with complex, heterogeneous data. The approach can be interpreted as a \textquotedblleft soft-clustering\textquotedblright\ in comparison to traditional\textquotedblleft\ hard clustering\textquotedblright that assigns each individual to exactly one group. We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator. Finally, we analyze a big data set from didichuxing.com, a leading company in transportation industry, to analyze and predict the gap between supply and demand based on a large set of covariates. Our estimator clearly performs much better in out-of-sample prediction compared to existing linear panel data estimators.