Luca Virgili

h-index18

4papers

837citations

4 Papers

16.2CLJul 16

Latent Trajectory Discrimination for AI-Generated Text Detection

Gianluca Bonifazi, Christopher Buratti, Michele Marchetti et al.

Most existing approaches to AI-Generated Text Detection (AIGTD) treat documents as static objects and base their decisions on aggregate statistics or globally compressed embeddings. However, this perspective overlooks the inherently dynamic nature of autoregressive generation, where content evolves progressively through the latent space. In this paper, we reformulate AIGTD as the problem of distinguishing between latent generation trajectories. Instead of relying on static representations, we model how textual representations evolve across the sequence. To this end, we propose Geometric Trajectory and Contrastive Learning (GTCL), a framework that segments the document into ordered local units, encodes each unit in an embedding space, and constructs a structured and sequence-level representation. GTCL then applies contrastive learning to these trajectories to learn geometric regularities associated with the autoregressive generation. Evaluations performed on three different benchmarks and several approaches show that GTCL outperforms detection baselines consistently, which implies that explicitly modeling sequential dynamics provides robust discriminative signals across models and domains. These results suggest that modeling trajectory differences could improve detection and open up a dynamic direction that has been underexplored in previous AIGTD literature.

20.9CLJun 29

Efficient Retrieval-Augmented Generation via Token Co-occurrence Graphs

Gianluca Bonifazi, Christopher Buratti, Michele Marchetti et al.

Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by grounding the generation process on external knowledge. However, standard RAG approaches struggle with multi-hop reasoning. While recent graph-based RAG methods improve the retrieval of interconnected chunks, they often rely on computationally expensive and error-prone LLM-based extraction pipelines. To address these issues, we propose TIGRAG (Token-Induced GraphRAG), an efficient graph-augmented RAG framework based on a token co-occurrence Knowledge Graph. TIGRAG directly models topological relationships between tokens using sliding-window co-occurrence statistics, thus enabling scalable graph construction. During inference, it combines graph-based semantic expansion and neural reranking to retrieve interconnected evidence for multi-hop reasoning. Specifically, it introduces an iterative entity-driven retrieval strategy that progressively expands the query using bridging entities extracted from previously retrieved contexts. We evaluated TIGRAG on three widely adopted multi-hop Question Answering (QA) benchmarks. Experimental results demonstrated that our framework consistently outperforms dense retrieval and graph-based RAG methods in both retrieval and downstream QA tasks, while substantially reducing indexing time, inference latency, and prompt footprint.

1.2SIDec 3, 2024

Characterizing Information Shared by Participants to Coding Challenges: The Case of Advent of Code

Francesco Cauteruccio, Enrico Corradini, Luca Virgili

Advent of Code (AoC from now on) is a popular coding challenge requiring to solve programming puzzles for a variety of skill sets and levels. AoC follows the advent calendar, therefore it is an annual challenge that lasts for 25 days. AoC participants usually post their solutions on social networks and discuss them online. These challenges are interesting to study since they could highlight the adoption of new tools, the evolution of the developer community, or the technological requirements of well-known companies. For these reasons, we first create a dataset of the 2019-2021 AoC editions containing the discussion threads made on the subreddit {\tt /r/adventofcode}. Then, we propose a model based on stream graphs to best study this context, where we represent its most important actors through time: participants, comments, and programming languages. Thanks to our model, we investigate user participation, adoption of new programming languages during a challenge and between two of them, and resiliency of programming languages based on a Stack Overflow survey. We find that the top-used programming languages are almost the same in the three years, pointing out their importance. Moreover, participants tend to keep the same programming language for the whole challenge, while the ones attending two AoCs usually change it in the next one. Finally, we observe interesting results about the programming languages that are ``Popular'' or ``Loved'' according to the Stack Overflow survey. Firstly, these are the ones adopted for the longest time in an AoC edition, thanks to which users have a high chance of reaching the end of the challenge. Secondly, they are the most chosen when a participant decides to change programming language during the same challenge.

2.3SIMay 17, 2023Code

Predicting Tweet Engagement with Graph Neural Networks

Marco Arazzi, Marco Cotogni, Antonino Nocera et al.

Social Networks represent one of the most important online sources to share content across a world-scale audience. In this context, predicting whether a post will have any impact in terms of engagement is of crucial importance to drive the profitable exploitation of these media. In the literature, several studies address this issue by leveraging direct features of the posts, typically related to the textual content and the user publishing it. In this paper, we argue that the rise of engagement is also related to another key component, which is the semantic connection among posts published by users in social media. Hence, we propose TweetGage, a Graph Neural Network solution to predict the user engagement based on a novel graph-based model that represents the relationships among posts. To validate our proposal, we focus on the Twitter platform and perform a thorough experimental campaign providing evidence of its quality.