A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata
It provides a data-driven tool for policymakers to address educational equity by targeting systemic disparities between schools, though it is incremental as it applies existing methods to new data.
This study tackled the problem of identifying factors influencing student performance in Brazil by developing a multi-level machine learning model using SAEB microdata, achieving 90.2% accuracy and 96.7% AUC with a Random Forest model and revealing that school socioeconomic level is the dominant predictor.
Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level machine learning approach to classify the proficiency of 9th-grade and high school students using microdata from the System of Assessment of Basic Education (SAEB). Our model uniquely integrates four data sources: student socioeconomic characteristics, teacher professional profiles, school indicators, and principal management profiles. A comparative analysis of four ensemble algorithms confirmed the superiority of a Random Forest model, which achieved 90.2% accuracy and an Area Under the Curve (AUC) of 96.7%. To move beyond prediction, we applied Explainable AI (XAI) using SHAP, which revealed that the school's average socioeconomic level is the most dominant predictor, demonstrating that systemic factors have a greater impact than individual characteristics in isolation. The primary conclusion is that academic performance is a systemic phenomenon deeply tied to the school's ecosystem. This study provides a data-driven, interpretable tool to inform policies aimed at promoting educational equity by addressing disparities between schools.