Simeon Sayer

LG
h-index1
7papers
4citations
Novelty24%
AI Score32

7 Papers

LGSep 16, 2024
From Bytes to Bites: Using Country Specific Machine Learning Models to Predict Famine

Salloni Kapoor, Simeon Sayer

Hunger crises are critical global issues affecting millions, particularly in low-income and developing countries. This research investigates how machine learning can be utilized to predict and inform decisions regarding famine and hunger crises. By leveraging a diverse set of variables (natural, economic, and conflict-related), three machine learning models (Linear Regression, XGBoost, and RandomForestRegressor) were employed to predict food consumption scores, a key indicator of household nutrition. The RandomForestRegressor emerged as the most accurate model, with an average prediction error of 10.6%, though accuracy varied significantly across countries, ranging from 2% to over 30%. Notably, economic indicators were consistently the most significant predictors of average household nutrition, while no single feature dominated across all regions, underscoring the necessity for comprehensive data collection and tailored, country-specific models. These findings highlight the potential of machine learning, particularly Random Forests, to enhance famine prediction, suggesting that continued research and improved data gathering are essential for more effective global hunger forecasting.

LGSep 17, 2024
Machine Learning for Public Good: Predicting Urban Crime Patterns to Enhance Community Safety

Sia Gupta, Simeon Sayer

In recent years, urban safety has become a paramount concern for city planners and law enforcement agencies. Accurate prediction of likely crime occurrences can significantly enhance preventive measures and resource allocation. However, many law enforcement departments lack the tools to analyze and apply advanced AI and ML techniques that can support city planners, watch programs, and safety leaders to take proactive steps towards overall community safety. This paper explores the effectiveness of ML techniques to predict spatial and temporal patterns of crimes in urban areas. Leveraging police dispatch call data from San Jose, CA, the research goal is to achieve a high degree of accuracy in categorizing calls into priority levels particularly for more dangerous situations that require an immediate law enforcement response. This categorization is informed by the time, place, and nature of the call. The research steps include data extraction, preprocessing, feature engineering, exploratory data analysis, implementation, optimization and tuning of different supervised machine learning models and neural networks. The accuracy and precision are examined for different models and features at varying granularity of crime categories and location precision. The results demonstrate that when compared to a variety of other models, Random Forest classification models are most effective in identifying dangerous situations and their corresponding priority levels with high accuracy (Accuracy = 85%, AUC = 0.92) at a local level while ensuring a minimum amount of false negatives. While further research and data gathering is needed to include other social and economic factors, these results provide valuable insights for law enforcement agencies to optimize resources, develop proactive deployment approaches, and adjust response patterns to enhance overall public safety outcomes in an unbiased way.

20.2CYMar 19
Hidden Signals in Language: Inferring Sensitive Attributes from Reddit Comments Using Machine Learning

Anay Agarwalla, Simeon Sayer

Sensitive attributes are legally protected characteristics that should not be used to discriminate. Careful steps have been taken to minimize the risk of human bias regarding these fields, such as race and age. Large language models (LLMs) are similarly trained not to attempt to infer these aspects. However, just because they shouldn't, doesn't mean they don't. Using chat-like text fragments from authors tagged with sensitive attributes (e.g., MBTI personality, country of origin, gender), a model can often classify these attributes better than a naive guess, with results depending on the combination of subject matter and attribute. The text data from these comments is converted into numerical representations using embedding models, which are then used to train relatively simple classifiers such as logistic regression and decision trees. This study's results show that even these lightweight models can detect statistically significant signals associated with sensitive attributes in user-generated text. The results show that demographic traits such as gender and age are more readily predictable, whereas personality traits are expressed more subtly and depend more heavily on context. Predictive performance varies across online Reddit communities, with some subreddits consistently revealing attributes, while others show high variability depending on the trait being analyzed. These findings indicate that language contains latent identity signals that users may not intend to disclose but are nevertheless detectable through computational methods, and imply that more complex language models may have an inherent, greater capacity to infer sensitive attributes. This raises important concerns about privacy, bias, and the potential misuse of inferred personal information in AI systems. We call for increased transparency, stronger safeguards, and careful policy consideration for future LLMs.

CVNov 2, 2024
Optimizing Violence Detection in Video Classification Accuracy through 3D Convolutional Neural Networks

Aarjav Kavathia, Simeon Sayer

As violent crimes continue to happen, it becomes necessary to have security cameras that can rapidly identify moments of violence with excellent accuracy. The purpose of this study is to identify how many frames should be analyzed at a time in order to optimize a violence detection model's accuracy as a parameter of the depth of a 3D convolutional network. Previous violence classification models have been created, but their application to live footage may be flawed. In this project, a convolutional neural network was created to analyze optical flow frames of each video. The number of frames analyzed at a time would vary with one, two, three, ten, and twenty frames, and each model would be trained for 20 epochs. The greatest validation accuracy was 94.87% and occurred with the model that analyzed three frames at a time. This means that machine learning models to detect violence may function better when analyzing three frames at a time for this dataset. The methodology used to identify the optimal number of frames to analyze at a time could be used in other applications of video classification, especially those of complex or abstract actions, such as violence.

CLSep 3, 2025
Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets

Sophie Jaffer, Simeon Sayer

As large language models (LLMs) expand multilingual capabilities, questions remain about the equity of their performance across languages. While many communities stand to benefit from AI systems, the dominance of English in training data risks disadvantaging non-English speakers. To test the hypothesis that such data disparities may affect model performance, this study compares two monolingual BERT models: one trained and tested entirely on Swahili data, and another on comparable English news data. To simulate how multilingual LLMs process non-English queries through internal translation and abstraction, we translated the Swahili news data into English and evaluated it using the English-trained model. This approach tests the hypothesis by evaluating whether translating Swahili inputs for evaluation on an English model yields better or worse performance compared to training and testing a model entirely in Swahili, thus isolating the effect of language consistency versus cross-lingual abstraction. The results prove that, despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors: 0.36% vs. 1.47% respectively. This gap suggests that translation alone does not bridge representational differences between languages and that models trained in one language may struggle to accurately interpret translated inputs due to imperfect internal knowledge representation, suggesting that native-language training remains important for reliable outcomes. In educational and informational contexts, even small performance gaps may compound inequality. Future research should focus on addressing broader dataset development for underrepresented languages and renewed attention to multilingual model evaluation, ensuring the reinforcing effect of global AI deployment on existing digital divides is reduced.

LGFeb 21, 2025
News Sentiment as a Predictor for American Domestic Migration

Benjamin Lane, Simeon Sayer

This paper goes into depth on the effect that US News Sentiment from national newspapers has on US interstate migration trends. Through harnessing data from the New York Times between 2010 and 2020, an average sentiment score was calculated, allowing for data to be entered into a neural network. Then a logistic regression model was used to predict interstate migration. The results indicate the model was highly accurate as the mean margin of error was +/- 900 citizens. The predictions from the model were compared with the US Census data from 2010 to 2020 that was used to train the model. Since the input for the model was not exposed to any migration data, the model clearly demonstrated that its results were drawn from sentiment data alone. These findings are significant as they indicate that the role of the press could be used as a predictor for domestic migration which can help the government and businesses understand better what is influencing people to move to certain places.

LGJan 3, 2025
Exploring Equality: An Investigation into Custom Loss Functions for Fairness Definitions

Gordon Lee, Simeon Sayer

This paper explores the complex tradeoffs between various fairness metrics such as equalized odds, disparate impact, and equal opportunity and predictive accuracy within COMPAS by building neural networks trained with custom loss functions optimized to specific fairness criteria. This paper creates the first fairness-driven implementation of the novel Group Accuracy Parity (GAP) framework, as theoretically proposed by Gupta et al. (2024), and applies it to COMPAS. To operationalize and accurately compare the fairness of COMPAS models optimized to differing fairness ideals, this paper develops and proposes a combinatory analytical procedure that incorporates Pareto front and multivariate analysis, leveraging data visualizations such as violin graphs. This paper concludes that GAP achieves an enhanced equilibrium between fairness and accuracy compared to COMPAS's current nationwide implementation and alternative implementations of COMPAS optimized to more traditional fairness definitions. While this paper's algorithmic improvements of COMPAS significantly augment its fairness, external biases undermine the fairness of its implementation. Practices such as predictive policing and issues such as the lack of transparency regarding COMPAS's internal workings have contributed to the algorithm's historical injustice. In conjunction with developments regarding COMPAS's predictive methodology, legal and institutional changes must happen for COMPAS's just deployment.