CLSep 28, 2023
The Confidence-Competence Gap in Large Language Models: A Cognitive StudyAniket Kumar Singh, Suman Devkota, Bishal Lamichhane et al.
Large Language Models (LLMs) have acquired ubiquitous attention for their performances across diverse domains. Our study here searches through LLMs' cognitive abilities and confidence dynamics. We dive deep into understanding the alignment between their self-assessed confidence and actual performance. We exploit these models with diverse sets of questionnaires and real-world scenarios and extract how LLMs exhibit confidence in their responses. Our findings reveal intriguing instances where models demonstrate high confidence even when they answer incorrectly. This is reminiscent of the Dunning-Kruger effect observed in human psychology. In contrast, there are cases where models exhibit low confidence with correct answers revealing potential underestimation biases. Our results underscore the need for a deeper understanding of their cognitive processes. By examining the nuances of LLMs' self-assessment mechanism, this investigation provides noteworthy revelations that serve to advance the functionalities and broaden the potential applications of these formidable language models.
AIFeb 15, 2024
GPT-4's assessment of its performance in a USMLE-based case studyUttam Dhakal, Aniket Kumar Singh, Suman Devkota et al.
This study investigates GPT-4's assessment of its performance in healthcare applications. A simple prompting technique was used to prompt the LLM with questions taken from the United States Medical Licensing Examination (USMLE) questionnaire and it was tasked to evaluate its confidence score before posing the question and after asking the question. The questionnaire was categorized into two groups-questions with feedback (WF) and questions with no feedback(NF) post-question. The model was asked to provide absolute and relative confidence scores before and after each question. The experimental findings were analyzed using statistical tools to study the variability of confidence in WF and NF groups. Additionally, a sequential analysis was conducted to observe the performance variation for the WF and NF groups. Results indicate that feedback influences relative confidence but doesn't consistently increase or decrease it. Understanding the performance of LLM is paramount in exploring its utility in sensitive areas like healthcare. This study contributes to the ongoing discourse on the reliability of AI, particularly of LLMs like GPT-4, within healthcare, offering insights into how feedback mechanisms might be optimized to enhance AI-assisted medical education and decision support.
CYApr 21, 2021
Public Perception of the German COVID-19 Contact-Tracing App Corona-Warn-AppFelix Beierle, Uttam Dhakal, Caroline Cohrdes et al.
Several governments introduced or promoted the use of contact-tracing apps during the ongoing COVID-19 pandemic. In Germany, the related app is called Corona-Warn-App, and by end of 2020, it had 22.8 million downloads. Contact tracing is a promising approach for containing the spread of the novel coronavirus. It is only effective if there is a large user base, which brings new challenges like app users unfamiliar with using smartphones or apps. As Corona-Warn-App is voluntary to use, reaching many users and gaining a positive public perception is crucial for its effectiveness. Based on app reviews and tweets, we are analyzing the public perception of Corona-Warn-App. We collected and analyzed all 78,963 app reviews for the Android and iOS versions from release (June 2020) to beginning of February 2021, as well as all original tweets until February 2021 containing #CoronaWarnApp (43,082). For the reviews, the most common words and n-grams point towards technical issues, but it remains unclear, to what extent this is due to the app itself, the used Exposure Notification Framework, system settings on the user's phone, or the user's misinterpretations of app content. For Twitter data, overall, based on tweet content, frequent hashtags, and interactions with tweets, we conclude that the German Twitter-sphere widely reports adopting the app and promotes its use.