Jiangfan Zhang

85.7AIMay 15

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Jianpeng Cheng, Xian Wu, Jiangfan Zhang et al.

Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an <eos> embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.

ITMay 24, 2016

Functional Forms of Optimum Spoofing Attacks for Vector Parameter Estimation in Quantized Sensor Networks

Jiangfan Zhang, Rick S. Blum, Lance Kaplan et al.

Estimation of an unknown deterministic vector from quantized sensor data is considered in the presence of spoofing attacks which alter the data presented to several sensors. Contrary to previous work, a generalized attack model is employed which manipulates the data using transformations with arbitrary functional forms determined by some attack parameters whose values are unknown to the attacked system. For the first time, necessary and sufficient conditions are provided under which the transformations provide a guaranteed attack performance in terms of Cramer-Rao Bound (CRB) regardless of the processing the estimation system employs, thus defining a highly desirable attack. Interestingly, these conditions imply that, for any such attack when the attacked sensors can be perfectly identified by the estimation system, either the Fisher Information Matrix (FIM) for jointly estimating the desired and attack parameters is singular or that the attacked system is unable to improve the CRB for the desired vector parameter through this joint estimation even though the joint FIM is nonsingular. It is shown that it is always possible to construct such a highly desirable attack by properly employing a sufficiently large dimension attack vector parameter relative to the number of quantization levels employed, which was not observed previously. To illustrate the theory in a concrete way, we also provide some numerical results which corroborate that under the highly desirable attack, attacked data is not useful in reducing the CRB.

Jiangfan Zhang

2 Papers