Milan Cabarkapa

CL
h-index16
4papers
36citations
Novelty35%
AI Score23

4 Papers

CLDec 15, 2023
Does GPT-4 surpass human performance in linguistic pragmatics?

Ljubisa Bojic, Predrag Kovacevic, Milan Cabarkapa

As Large Language Models (LLMs) become increasingly integrated into everyday life as general purpose multimodal AI systems, their capabilities to simulate human understanding are under examination. This study investigates LLMs ability to interpret linguistic pragmatics, which involves context and implied meanings. Using Grice communication principles, we evaluated both LLMs (GPT-2, GPT-3, GPT-3.5, GPT-4, and Bard) and human subjects (N = 147) on dialogue-based tasks. Human participants included 71 primarily Serbian students and 76 native English speakers from the United States. Findings revealed that LLMs, particularly GPT-4, outperformed humans. GPT4 achieved the highest score of 4.80, surpassing the best human score of 4.55. Other LLMs performed well: GPT 3.5 scored 4.10, Bard 3.75, and GPT-3 3.25. GPT-2 had the lowest score of 1.05. The average LLM score was 3.39, exceeding the human cohorts averages of 2.80 (Serbian students) and 2.34 (U.S. participants). In the ranking of all 155 subjects (including LLMs and humans), GPT-4 secured the top position, while the best human ranked second. These results highlight significant progress in LLMs ability to simulate understanding of linguistic pragmatics. Future studies should confirm these findings with more dialogue-based tasks and diverse participants. This research has important implications for advancing general-purpose AI models in various communication-centered tasks, including potential application in humanoid robots in the future.

CYJan 7, 2024
The Dual Impact of Virtual Reality: Examining the Addictive Potential and Therapeutic Applications of Immersive Media in the Metaverse

Ljubisa Bojic, Joerg Matthes, Agariadne Dwinggo Samala et al.

The emergence of the metaverse - envisioned as a hyperreal virtual universe enabling boundless human interaction - has the potential to revolutionize our conception of media. This transformation could alter society as we know it. This paper identifies addictive features of social media, including immersion, interactivity, real-time access, and personalization. These features are examined within the context of virtual reality through a literature review and content analysis, aimed at exploring the potential consequences of metaverse development. From an initial pool of 193,218 documents, a refined selection of N = 44 relevant papers formed the basis of our qualitative analysis. About half of the analyzed papers indicate that these features contribute to VR addiction. Interestingly, the same features that contribute to addictive behaviors can also be harnessed for positive therapeutic interventions of VR, particularly in treating addictions and managing mental health conditions. This duality, observed in the other half of the papers, emphasizes the complex role of VR technologies, suggesting that they can serve as a substitute for other addictions. This phenomenon is placed into the historical context of evolving media technologies that increasingly mimic reality. The complex interplay of factors contributing to addiction necessitates the development of algorithmic solutions that actively curate diverse offerings, rather than promoting a closed loop of like-minded views. Traditional models of addiction should be adapted to address these unique challenges. Finally, the discussion turned to the implications of these findings for a society where the metaverse is widely accepted as a mainstream technology.

CLJan 5, 2025
Evaluating Large Language Models Against Human Annotators in Latent Content Analysis: Sentiment, Political Leaning, Emotional Intensity, and Sarcasm

Ljubisa Bojic, Olga Zagovora, Asta Zelenkauskaite et al.

In the era of rapid digital communication, vast amounts of textual data are generated daily, demanding efficient methods for latent content analysis to extract meaningful insights. Large Language Models (LLMs) offer potential for automating this process, yet comprehensive assessments comparing their performance to human annotators across multiple dimensions are lacking. This study evaluates the reliability, consistency, and quality of seven state-of-the-art LLMs, including variants of OpenAI's GPT-4, Gemini, Llama, and Mixtral, relative to human annotators in analyzing sentiment, political leaning, emotional intensity, and sarcasm detection. A total of 33 human annotators and eight LLM variants assessed 100 curated textual items, generating 3,300 human and 19,200 LLM annotations, with LLMs evaluated across three time points to examine temporal consistency. Inter-rater reliability was measured using Krippendorff's alpha, and intra-class correlation coefficients assessed consistency over time. The results reveal that both humans and LLMs exhibit high reliability in sentiment analysis and political leaning assessments, with LLMs demonstrating higher internal consistency than humans. In emotional intensity, LLMs displayed higher agreement compared to humans, though humans rated emotional intensity significantly higher. Both groups struggled with sarcasm detection, evidenced by low agreement. LLMs showed excellent temporal consistency across all dimensions, indicating stable performance over time. This research concludes that LLMs, especially GPT-4, can effectively replicate human analysis in sentiment and political leaning, although human expertise remains essential for emotional intensity interpretation. The findings demonstrate the potential of LLMs for consistent and high-quality performance in certain areas of latent content analysis.

CYJan 5, 2025
Towards New Benchmark for AI Alignment & Sentiment Analysis in Socially Important Issues: A Comparative Study of Human and LLMs in the Context of AGI

Ljubisa Bojic, Dylan Seychell, Milan Cabarkapa

As general-purpose artificial intelligence systems become increasingly integrated into society and are used for information seeking, content generation, problem solving, textual analysis, coding, and running processes, it is crucial to assess their long-term impact on humans. This research explores the sentiment of large language models (LLMs) and humans toward artificial general intelligence (AGI) using a Likert-scale survey. Seven LLMs, including GPT-4 and Bard, were analyzed and compared with sentiment data from three independent human sample populations. Temporal variations in sentiment were also evaluated over three consecutive days. The results show a diversity in sentiment scores among LLMs, ranging from 3.32 to 4.12 out of 5. GPT-4 recorded the most positive sentiment toward AGI, while Bard leaned toward a neutral sentiment. In contrast, the human samples showed a lower average sentiment of 2.97. The analysis outlines potential conflicts of interest and biases in the sentiment formation of LLMs, and indicates that LLMs could subtly influence societal perceptions. To address the need for regulatory oversight and culturally grounded assessments of AI systems, we introduce the Societal AI Alignment and Sentiment Benchmark (SAAS-AI), which leverages multidimensional prompts and empirically validated societal value frameworks to evaluate language model outputs across temporal, model, and multilingual axes. This benchmark is designed to guide policymakers and AI agencies, including within frameworks such as the EU AI Act, by providing robust, actionable insights into AI alignment with human values, public sentiment, and ethical norms at both national and international levels. Future research should further refine the operationalization of the SAAS-AI benchmark and systematically evaluate its effectiveness through comprehensive empirical testing.