Understanding the Challenges and Assisting Developers with Developing Spark Applications
This work addresses debugging difficulties for developers using big data frameworks like Spark, but it is incremental as it builds on existing tools and focuses on a specific domain.
The paper tackles the challenge of developers understanding and debugging data processing code in Apache Spark by conducting an empirical study on 1,000 Stack Overflow questions, finding issues related to data transformation and API usage, and designing an approach that uses statistical sampling to provide intermediate information and hints with low performance overhead and positive developer feedback.
To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the challenges in using big data frameworks, we first conduct an empirical study on 1,000 Apache Spark-related questions on Stack Overflow. We find that most of the challenges are related to data transformation and API usage. To solve these challenges, we design an approach, which assists developers with understanding and debugging data processing in Spark. Our approach leverages statistical sampling to minimize performance overhead, and provides intermediate information and hint messages for each data processing step of a chained method pipeline. The preliminary evaluation of our approach shows that it has low performance overhead and we receive good feedback from developers.