Xin Li

h-index53

3papers

132citations

Novelty72%

AI Score44

Ranked #50,404 of 194,257 authors (top 26%)#27 in CC (top 29%)

3 Papers

22.3CVApr 9, 2025Code

OmniCaptioner: One Captioner to Rule Them All

Yiting Lu, Jiakang Yuan, Zhen Li et al.

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

7.1LGJul 3, 2025

Cycle-Consistent Helmholtz Machine: Goal-Seeded Simulation via Inverted Inference

Xin Li

The Helmholtz Machine (HM) is a foundational architecture for unsupervised learning, coupling a bottom-up recognition model with a top-down generative model through alternating inference. However, its reliance on symmetric, data-driven updates constrains its ability to perform goal-directed reasoning or simulate temporally extended processes. In this work, we introduce the \emph{Cycle-Consistent Helmholtz Machine} (C$^2$HM), a novel extension that reframes inference as a \emph{goal-seeded}, \emph{asymmetric} process grounded in structured internal priors. Rather than inferring latent causes solely from sensory data, C$^2$HM simulates plausible latent trajectories conditioned on abstract goals, aligning them with observed outcomes through a recursive cycle of forward generation and inverse refinement. This cycle-consistent formulation integrates top-down structure with bottom-up evidence via a variational loop, enforcing mutual alignment between goal-conditioned latent predictions and recognition-based reconstructions. We formalize this mechanism within the framework of the \emph{Context-Content Uncertainty Principle} (CCUP), which posits that inference proceeds by aligning structured, low-entropy content with high-entropy, ambiguous context. C$^2$HM improves representational efficiency, supports memory chaining via path-dependent inference, and enables spatial compositional imagination. By offering a biologically inspired alternative to classical amortized inference, $C^2$HM reconceives generative modeling as intentional simulation, bridging memory-based planning and unsupervised learning in a unified probabilistic framework.

5.1CCJul 30, 2016

Improved Non-Malleable Extractors, Non-Malleable Codes and Independent Source Extractors

Xin Li

In this paper we give improved constructions of several central objects in the literature of randomness extraction and tamper-resilient cryptography. Our main results are: (1) An explicit seeded non-malleable extractor with error $ε$ and seed length $d=O(\log n)+O(\log(1/ε)\log \log (1/ε))$, that supports min-entropy $k=Ω(d)$ and outputs $Ω(k)$ bits. Combined with the protocol in \cite{DW09}, this gives a two round privacy amplification protocol with optimal entropy loss in the presence of an active adversary, for all security parameters up to $Ω(k/\log k)$. (2) An explicit non-malleable two-source extractor for min-entropy $k \geq (1-γ)n$, some constant $γ>0$, that outputs $Ω(k)$ bits with error $2^{-Ω(n/\log n)}$. Combined with the connection in \cite{CG14b} this gives a non-malleable code in the two-split-state model with relative rate $Ω(1/\log n)$. This exponentially improves previous constructions, all of which only achieve rate $n^{-Ω(1)}$.\footnote{The work of Aggarwal et. al \cite{ADKO15} had a construction which "achieves" constant rate, but recently the author found an error in their proof.} (3)A two-source extractor for min-entropy $O(\log n \log \log n)$, which also implies a $K$-Ramsey graph on $N$ vertices with $K=(\log N)^{O(\log \log \log N)}$. We also obtain a seeded non-malleable $9$-source extractor with optimal seed length, which in turn gives a $10$-source extractor for min-entropy $O(\log n)$. Previously the best known extractor for such min-entropy requires $O(\log \log n)$ sources \cite{CohL16}. Independent of our work, Cohen \cite{Cohen16} obtained similar results to (1) and the two-source extractor, except the dependence on $ε$ is $\log(1/ε)(\log \log (1/ε))^{O(1)}$ and the two-source extractor requires min-entropy $\log n (\log \log n)^{O(1)}$.