63.5CLMay 25
Triplet-Block Diffusion RWKVKe Lin, Yiyang Luo, Zhaolong Su et al.
Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and discrete diffusion models each address these weaknesses, their integration remains inherently inconsistent: diffusion requires bidirectional attention, while causal models are unidirectional. To unify these architectures, we propose $B^3D-RWKV$, a diffusion RWKV variant that integrates the model's $O(L)$ inference efficiency with parallel, bidirectional discrete-diffusion through a \emph{triplet-block layout} method. $B^3D-RWKV-7.2B$ reaches comparable accuracy on an 8-task suite versus existing models while significantly outperforming baselines in decoding throughput with an average of $\mathbf{1.6\times}$ speedup.
CVSep 2, 2024
ViRED: Prediction of Visual Relations in Engineering DrawingsChao Gu, Ke Lin, Yiyang Luo et al.
To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.
CVNov 26, 2023
Context-Aware Indoor Point Cloud Object Generation through User InstructionsYiyang Luo, Ke Lin, Chao Gu
Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our model revolutionizes scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation.
CVDec 12, 2024
Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context LearningShihao Xu, Yiyang Luo, Wei Shi
Geometry mathematics problems pose significant challenges for large language models (LLMs) because they involve visual elements and spatial reasoning. Current methods primarily rely on symbolic character awareness to address these problems. Considering geometry problem solving is a relatively nascent field with limited suitable datasets and currently almost no work on solid geometry problem solving, we collect a geometry question-answer dataset by sourcing geometric data from Chinese high school education websites, referred to as GeoMath. It contains solid geometry questions and answers with accurate reasoning steps as compensation for existing plane geometry datasets. Additionally, we propose a Large Multi-modal Model (LMM) framework named Geo-LLaVA, which incorporates retrieval augmentation with supervised fine-tuning (SFT) in the training stage, called meta-training, and employs in-context learning (ICL) during inference to improve performance. Our fine-tuned model with ICL attains the state-of-the-art performance of 65.25% and 42.36% on selected questions of the GeoQA dataset and GeoMath dataset respectively with proper inference steps. Notably, our model initially endows the ability to solve solid geometry problems and supports the generation of reasonable solid geometry picture descriptions and problem-solving steps. Our research sets the stage for further exploration of LLMs in multi-modal math problem-solving, particularly in geometry math problems.
CLMar 16, 2024
Zero-shot Generative Linguistic SteganographyKe Lin, Yiyang Luo, Zijian Zhang et al.
Generative linguistic steganography attempts to hide secret messages into covertext. Previous studies have generally focused on the statistical differences between the covertext and stegotext, however, ill-formed stegotext can readily be identified by humans. In this paper, we propose a novel zero-shot approach based on in-context learning for linguistic steganography to achieve better perceptual and statistical imperceptibility. We also design several new metrics and reproducible language evaluations to measure the imperceptibility of the stegotext. Our experimental results indicate that our method produces $1.926\times$ more innocent and intelligible stegotext than any other method.
CLMar 15, 2024
Lost in Overlap: Exploring Logit-based Watermark Collision in LLMsYiyang Luo, Ke Lin, Chao Gu et al.
The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread usage of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks, such as paraphrasing or translation. In this paper, we introduce watermark collision as a novel and general philosophy for watermark attacks, aimed at enhancing attack performance on top of any other attacking methods. We also provide a comprehensive demonstration that watermark collision poses a threat to all logit-based watermark algorithms, impacting not only specific attack scenarios but also downstream applications.