BMJun 6, 2023Code
MolFM: A Multimodal Molecular Foundation ModelYizhen Luo, Kai Yang, Massimo Hong et al.
Molecular knowledge resides within three different modalities of information sources: molecular structures, biomedical documents, and knowledge bases. Effective incorporation of molecular knowledge from these modalities holds paramount significance in facilitating biomedical research. However, existing multimodal molecular foundation models exhibit limitations in capturing intricate connections between molecular structures and texts, and more importantly, none of them attempt to leverage a wealth of molecular expertise derived from knowledge graphs. In this study, we introduce MolFM, a multimodal molecular foundation model designed to facilitate joint representation learning from molecular structures, biomedical texts, and knowledge graphs. We propose cross-modal attention between atoms of molecular structures, neighbors of molecule entities and semantically related texts to facilitate cross-modal comprehension. We provide theoretical analysis that our cross-modal pre-training captures local and global molecular knowledge by minimizing the distance in the feature space between different modalities of the same molecule, as well as molecules sharing similar structures or functions. MolFM achieves state-of-the-art performance on various downstream tasks. On cross-modal retrieval, MolFM outperforms existing models with 12.13% and 5.04% absolute gains under the zero-shot and fine-tuning settings, respectively. Furthermore, qualitative analysis showcases MolFM's implicit ability to provide grounding from molecular substructures and knowledge graphs. Code and models are available on https://github.com/BioFM/OpenBioMed.
CLFeb 26, 2023Code
Efficient Ensemble for Multimodal Punctuation Restoration using Time-Delay Neural NetworkXing Yi Liu, Homayoon Beigi
Punctuation restoration plays an essential role in the post-processing procedure of automatic speech recognition, but model efficiency is a key requirement for this task. To that end, we present EfficientPunct, an ensemble method with a multimodal time-delay neural network that outperforms the current best model by 1.0 F1 points, using less than a tenth of its inference network parameters. We streamline a speech recognizer to efficiently output hidden layer acoustic embeddings for punctuation restoration, as well as BERT to extract meaningful text embeddings. By using forced alignment and temporal convolutions, we eliminate the need for attention-based fusion, greatly increasing computational efficiency and raising performance. EfficientPunct sets a new state of the art with an ensemble that weights BERT's purely language-based predictions slightly more than the multimodal network's predictions. Our code is available at https://github.com/lxy-peter/EfficientPunct.
CLSep 17, 2024Code
Spontaneous Informal Speech Dataset for Punctuation RestorationXing Yi Liu, Homayoon Beigi
Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset building and model runs.
LGApr 17, 2023
Towards Unified AI Drug Discovery with Multiple Knowledge ModalitiesYizhen Luo, Xing Yi Liu, Kai Yang et al.
In recent years, AI models that mine intrinsic patterns from molecular structures and protein sequences have shown promise in accelerating drug discovery. However, these methods partly lag behind real-world pharmaceutical approaches of human experts that additionally grasp structured knowledge from knowledge bases and unstructured knowledge from biomedical literature. To bridge this gap, we propose KEDD, a unified, end-to-end, and multimodal deep learning framework that optimally incorporates both structured and unstructured knowledge for vast AI drug discovery tasks. The framework first extracts underlying characteristics from heterogeneous inputs, and then applies multimodal fusion for accurate prediction. To mitigate the problem of missing modalities, we leverage multi-head sparse attention and a modality masking mechanism to extract relevant information robustly. Benefiting from integrated knowledge, our framework achieves a deeper understanding of molecule entities, brings significant improvements over state-of-the-art methods on a wide range of tasks and benchmarks, and reveals its promising potential in assisting real-world drug discovery.
LGJun 14, 2024Code
Learning Multi-view Molecular Representations with Structured and Unstructured KnowledgeYizhen Luo, Kai Yang, Massimo Hong et al.
Capturing molecular knowledge with representation learning approaches holds significant potential in vast scientific fields such as chemistry and life science. An effective and generalizable molecular representation is expected to capture the consensus and complementary molecular expertise from diverse views and perspectives. However, existing works fall short in learning multi-view molecular representations, due to challenges in explicitly incorporating view information and handling molecular knowledge from heterogeneous sources. To address these issues, we present MV-Mol, a molecular representation learning model that harvests multi-view molecular expertise from chemical structures, unstructured knowledge from biomedical texts, and structured knowledge from knowledge graphs. We utilize text prompts to model view information and design a fusion architecture to extract view-based molecular representations. We develop a two-stage pre-training procedure, exploiting heterogeneous data of varying quality and quantity. Through extensive experiments, we show that MV-Mol provides improved representations that substantially benefit molecular property prediction. Additionally, MV-Mol exhibits state-of-the-art performance in multi-modal comprehension of molecular structures and texts. Code and data are available at https://github.com/PharMolix/OpenBioMed.