SEJul 10, 2023
Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structuresSayed Erfan Arefin, Tasnia Ashrafi Heya, Hasan Al-Qudah et al.
The transformative influence of Large Language Models (LLMs) is profoundly reshaping the Artificial Intelligence (AI) technology domain. Notably, ChatGPT distinguishes itself within these models, demonstrating remarkable performance in multi-turn conversations and exhibiting code proficiency across an array of languages. In this paper, we carry out a comprehensive evaluation of ChatGPT's coding capabilities based on what is to date the largest catalog of coding challenges. Our focus is on the python programming language and problems centered on data structures and algorithms, two topics at the very foundations of Computer Science. We evaluate ChatGPT for its ability to generate correct solutions to the problems fed to it, its code quality, and nature of run-time errors thrown by its code. Where ChatGPT code successfully executes, but fails to solve the problem at hand, we look into patterns in the test cases passed in order to gain some insights into how wrong ChatGPT code is in these kinds of situations. To infer whether ChatGPT might have directly memorized some of the data that was used to train it, we methodically design an experiment to investigate this phenomena. Making comparisons with human performance whenever feasible, we investigate all the above questions from the context of both its underlying learning models (GPT-3.5 and GPT-4), on a vast array sub-topics within the main topics, and on problems having varying degrees of difficulty.
DBFeb 12
Vibe Coding on Trial: Operating Characteristics of Unanimous LLM JuriesMuhammad Aziz Ullah, Abdul Serwadda
Large Language Models (LLMs) are now good enough at coding that developers can describe intent in plain language and let the tool produce the first code draft, a workflow increasingly built into tools like GitHub Copilot, Cursor, and Replit. What is missing is a reliable way to tell which model written queries are safe to accept without sending everything to a human. We study the application of an LLM jury to run this review step. We first benchmark 15 open models on 82 MySQL text to SQL tasks using an execution grounded protocol to get a clean baseline of which models are strong. From the six best models we build unanimous committees of sizes 1 through 6 that see the prompt, schema, and candidate SQL and accept it only when every member says it is correct. This rule matches safety first deployments where false accepts are more costly than false rejects. We measure true positive rate, false positive rate and Youden J and we also look at committees per generator. Our results show that single model judges are uneven, that small unanimous committees of strong models can cut false accepts while still passing many good queries, and that the exact committee composition matters significantly.
CRAug 22, 2025
Guarding Your Conversations: Privacy Gatekeepers for Secure Interactions with Cloud-Based AI ModelsGodsGift Uzor, Hasan Al-Qudah, Ynes Ineza et al.
The interactive nature of Large Language Models (LLMs), which closely track user data and context, has prompted users to share personal and private information in unprecedented ways. Even when users opt out of allowing their data to be used for training, these privacy settings offer limited protection when LLM providers operate in jurisdictions with weak privacy laws, invasive government surveillance, or poor data security practices. In such cases, the risk of sensitive information, including Personally Identifiable Information (PII), being mishandled or exposed remains high. To address this, we propose the concept of an "LLM gatekeeper", a lightweight, locally run model that filters out sensitive information from user queries before they are sent to the potentially untrustworthy, though highly capable, cloud-based LLM. Through experiments with human subjects, we demonstrate that this dual-model approach introduces minimal overhead while significantly enhancing user privacy, without compromising the quality of LLM responses.
CRSep 19, 2019
Kinetic Song Comprehension: Deciphering Personal Listening Habits via Phone VibrationsRichard Matovu, Isaac Griswold-Steiner, Abdul Serwadda
Music is an expression of our identity, showing a significant correlation with other personal traits, beliefs, and habits. If accessed by a malicious entity, an individual's music listening habits could be used to make critical inferences about the user. In this paper, we showcase an attack in which the vibrations propagated through a user's phone while playing music via its speakers can be used to detect and classify songs. Our attack shows that known songs can be detected with an accuracy of just under 80%, while a corpus of 100 songs can be classified with an accuracy greater than 80%. We investigate such questions under a wide variety of experimental scenarios involving three surfaces and five phone speaker volumes. Although users can mitigate some of the risk by using a phone cover to dampen the vibrations, we show that a sophisticated attacker could adapt the attack to still classify songs with a decent accuracy. This paper demonstrates a new way in which motion sensor data can be leveraged to intrude on user music preferences without their express permission. Whether this information is leveraged for financial gain or political purposes, our research makes a case for why more rigorous methods of protecting user data should be utilized by companies, and if necessary, individuals.