Tokenization of Gaze Data
This addresses the problem of integrating gaze data into multimodal models for researchers in human-computer interaction and vision, but it is incremental as it adapts existing tokenization methods to a new data type.
The paper tackled the lack of tokenization strategies for gaze data by analyzing five tokenizers on three datasets for forecasting and generation using LLMs, finding that a quantile tokenizer performed best for predicting gaze positions and k-means for gaze velocities.
A considerable part of the performance of today's large language models (LLM's) and multimodal large language models (MLLM's) depends on their tokenization strategies. While tokenizers are extensively researched for textual and visual input, there is no research on tokenization strategies for gaze data due to its nature. However, a corresponding tokenization strategy would allow using the vision capabilities of pre-trained MLLM's for gaze data, for example, through fine-tuning. In this paper, we aim to close this research gap by analyzing five different tokenizers for gaze data on three different datasets for the forecasting and generation of gaze data through LLMs (cf.~\cref{fig:teaser}). We evaluate the tokenizers regarding their reconstruction and compression abilities. Further, we train an LLM for each tokenization strategy, measuring its generative and predictive performance. Overall, we found that a quantile tokenizer outperforms all others in predicting the gaze positions and k-means is best when predicting gaze velocities.