LLM-EvRep: Learning an LLM-Compatible Event Representation Using a Self-Supervised Framework
This work addresses the challenge of efficiently processing event-driven visual content for AI systems, though it is incremental as it builds on existing event recognition and LLM capabilities.
The paper tackles the problem of adapting large language models (LLMs) for event-based visual recognition by proposing LLM-EvRep, an LLM-compatible event representation generator, which improves recognition performance over the event-to-video method E2VID by up to 50.21% on datasets like N-MNIST.
Recent advancements in event-based recognition have demonstrated significant promise, yet most existing approaches rely on extensive training, limiting their adaptability for efficient processing of event-driven visual content. Meanwhile, large language models (LLMs) have exhibited remarkable zero-shot capabilities across diverse domains, but their application to event-based visual recognition remains largely unexplored. To bridge this gap, we propose \textbf{LLM-EvGen}, an event representation generator that produces LLM-compatible event representations \textbf{LLM-EvRep}, thereby enhancing the performance of LLMs on event recognition tasks. The generator is trained using a self-supervised framework, aligning the generated representations with semantic consistency and structural fidelity. Comprehensive experiments were conducted on three datasets: N-ImageNet, N-Caltech101, and N-MNIST. The results demonstrate that our method, \textbf{LLM-EvRep}, outperforms the event-to-video method, E2VID, by 15.93\%, 0.82\%, and 50.21\%, respectively, in recognition tasks when evaluated using GPT-4o.