Yetao Wu

CL
4papers
13citations
Novelty54%
AI Score27

4 Papers

LGSep 9, 2024
M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

Hongyang Lei, Xiaolong Cheng, Qi Qin et al.

Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference. Our observation suggests that M3-JEPA might become a new basis to self-supervised learning in the open world.

CLSep 10, 2024
LaMsS: When Large Language Models Meet Self-Skepticism

Yetao Wu, Yihong Wang, Teng Chen et al.

Hallucination is a major challenge for large language models (LLMs), preventing their further application in some fields. The skeptical thinking of humankind could be useful for LLMs to self-cognition, self-reflection and alleviate their hallucinations. Inspired by this consideration, we propose a novel approach called LaMsS, which combines the semantic understanding capability of LLMs with self-skepticism. By introducing a series of skepticism tokens and augmenting them into the vocabulary, we conduct both pertaining and finetuning, which allow the LLM to decode each normal token followed by a skeptical token, representing different skepticism levels. By calculating the response skepticism given a query, one can define a new self-aware LLM which is only willing to answer with relative lower skepticism level than the threshold. By examining the accuracy, AUC and AP of willingly answering questions, we demonstrate that LaMsS achieves better performance than baselines on both multi-choice questions and open-domain question-answering benchmarks, and can generalize to multi-task and out-of-domain settings. Our study sheds some lights on the self-skepticism modeling on further artificial intelligence. Project code and model checkpoints can be found in https://anonymous.4open.science/r/SM-1E76.

CLSep 10, 2024
A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Ningyuan Xi, Yetao Wu, Kun Fan et al.

Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.

CLSep 18, 2024
MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning

Ningyuan Xi, Xiaoyu Wang, Yetao Wu et al.

Current research efforts are focused on enhancing the thinking and reasoning capability of large language model (LLM) by prompting, data-driven emergence and inference-time computation. In this study, we consider stimulating language model's thinking and cognitive abilities from a modular perspective, which mimics the human brain architecture. We select a specific intermediate attention layer with newly implemented language heads. We conduct dual-layer fine-tuning by annotated (query, thought, answer) samples and show that the intermediate layer can also learn to decode fluent and reasonable language tokens. A two-pass inference mechanism is designed to generate thoughts then formal responses. The entire framework is called modularized thinking language model (MeTHanol) which can enhance LLM's cognitive behaviors as indicated by Theory of Mind (ToM) and Vignette-based experiments. Case studies also show that MeTHanol can plan and self-reflect and generate human-like thoughts and answers, even on unseen and open-domain tasks. MeTHanol can also adapt to a personalized prompt and behave as the specified character. Our study holds promise for significant cognitive gains from a modular perspective. Our code, model and data are available at https://bachozean.github.io/methanol-page