Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
This addresses the problem of enhancing truthfulness and ethics in language models for AI safety and alignment research, though it appears incremental as it builds on existing tuning methods.
The study found that visual instruction tuning for multi-modal large language models unexpectedly improves truthfulness and ethical alignment in pure NLP tasks, with a tuned LLaMA2 7B model outperforming a chat version fine-tuned with over one million human annotations on benchmarks like TruthfulQA-mc and Ethics.
Multi-modal large language models (MLLMs) are trained based on large language models (LLM), with an enhanced capability to comprehend multi-modal inputs and generate textual responses. While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested. In this study, we get out of the box and unveil an intriguing characteristic of MLLMs -- our preliminary results suggest that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment in the pure NLP context. For example, a visual-instruction-tuned LLaMA2 7B model surpasses the performance of the LLaMA2-chat 7B model, fine-tuned with over one million human annotations, on TruthfulQA-mc and Ethics benchmarks. Further analysis reveals that the improved alignment can be attributed to the superior instruction quality inherent to visual-text data. In releasing our code at github.com/UCSC-VLAA/Sight-Beyond-Text, we aspire to foster further exploration into the intrinsic value of visual-text synergies and, in a broader scope, multi-modal interactions in alignment research.