CL CYMar 24

GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution

Nitin Choudhury, Bikrant Bikram Pratap Maurya, Bhavinkumar Vinodbhai Kuwar, Arun Balaji Buduru

arXiv:2604.1637796.2h-index: 1Has Code

Predicted impact top 8% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For forensic and security analysts, GoCoMA provides a more accurate method to distinguish between human-written and LLM-generated code, addressing practical concerns like security vulnerabilities and licensing ambiguity.

GoCoMA introduces a hyperbolic multimodal fusion framework for attributing code to its generating LLM, combining code stylometry and binary pre-executable artifact images. It achieves state-of-the-art results on two benchmarks, outperforming unimodal and Euclidean multimodal baselines.

Large Language Models (LLMs) trained on massive code corpora are now increasingly capable of generating code that is hard to distinguish from human-written code. This raises practical concerns, including security vulnerabilities and licensing ambiguity, and also motivates a forensic question: 'Who (or which LLM) wrote this piece of code?' We present GoCoMA, a multimodal framework that models an extrinsic hierarchy between (i) code stylometry, capturing higher-level structural and stylistic signatures, and (ii) image representations of binary pre-executable artifacts (BPEA), capturing lower-level, execution-oriented byte semantics shaped by compilation and toolchains. GoCoMA projects modality embeddings into a hyperbolic Poincaré ball, fuses them via a geodesic-cosine similarity-based cross-modal attention (GCSA) fusion mechanism, and back-projects the fused representation to Euclidean space for final LLM-source attribution. Experiments on two open-source benchmarks (CoDET-M4 and LLMAuthorBench) show that GoCoMA consistently outperforms unimodal and Euclidean multimodal baselines under identical evaluation protocols.

View on arXiv PDF

Similar