Boyu Yang

7.6CVJun 17, 2024Code

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

Tianren Ma, Lingxi Xie, Yunjie Tian et al.

Aligning vision and language concepts at a finer level remains an essential topic of multimodal large language models (MLLMs), particularly for tasks such as referring and grounding. Existing methods, such as proxy encoding and geometry encoding, incorporate additional syntax to encode spatial information, imposing extra burdens when communicating between language and vision modules. In this study, we propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens that collaboratively represent higher level semantics. A hybrid perception mechanism is also explored to perceive and understand scenes from both discrete and continuous spaces. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. By leveraging a joint vision-language vocabulary, ClawMachine further integrates referring and grounding in an auto-regressive manner, demonstrating great potential with scaled-up pre-training data. Experiments show that ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency. It also exhibits the potential to integrate multi-source information for complex visual reasoning, which is beyond the capability of many MLLMs. Our code is available at github.com/martian422/ClawMachine.

10.2CVMay 26, 2025

ReDDiT: Rehashing Noise for Discrete Visual Generation

Tianren Ma, Xiaosong Zhang, Boyu Yang et al.

In the visual generative area, discrete diffusion models are gaining traction for their efficiency and compatibility. However, pioneered attempts still fall behind their continuous counterparts, which we attribute to noise (absorbing state) design and sampling heuristics. In this study, we propose a rehashing noise approach for discrete diffusion transformer (termed ReDDiT), with the aim to extend absorbing states and improve expressive capacity of discrete diffusion models. ReDDiT enriches the potential paths that latent variables traverse during training with randomized multi-index corruption. The derived rehash sampler, which reverses the randomized absorbing paths, guarantees high diversity and low discrepancy of the generation process. These reformulations lead to more consistent and competitive generation quality, mitigating the need for heavily tuned randomness. Experiments show that ReDDiT significantly outperforms the baseline model (reducing gFID from 6.18 to 1.61) and is on par with the continuous counterparts.

Boyu Yang

2 Papers