PixT3: Pixel-based Table-To-Text Generation
This addresses the problem of information loss and inefficiency in table-to-text generation for NLP applications, though it is incremental as it builds on multimodal approaches.
The paper tackles table-to-text generation by rethinking it as a visual recognition task, avoiding linearization issues, and shows that PixT3 is competitive or superior to text-based models on benchmarks like ToTTo and Logic2Text.
Table-to-text generation involves generating appropriate textual descriptions given structured tabular data. It has attracted increasing attention in recent years thanks to the popularity of neural network models and the availability of large-scale datasets. A common feature across existing methods is their treatment of the input as a string, i.e., by employing linearization techniques that do not always preserve information in the table, are verbose, and lack space efficiency. We propose to rethink data-to-text generation as a visual recognition task, removing the need for rendering the input in a string format. We present PixT3, a multimodal table-to-text model that overcomes the challenges of linearization and input size limitations encountered by existing models. PixT3 is trained with a new self-supervised learning objective to reinforce table structure awareness and is applicable to open-ended and controlled generation settings. Experiments on the ToTTo and Logic2Text benchmarks show that PixT3 is competitive and, in some settings, superior to generators that operate solely on text.