LG AI CVApr 27

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich

AmazonNVIDIA

arXiv:2604.2495476.61 citationsh-index: 43

Predicted impact top 1% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This work provides an efficient, open multimodal model for researchers and developers, but the improvements are incremental over the existing Nemotron Nano V2 VL.

Nemotron 3 Nano Omni introduces native audio input support alongside text, images, and video, achieving consistent accuracy improvements over its predecessor across all modalities, with leading results in document understanding, long audio-video comprehension, and agentic computer use, while also delivering lower inference latency and higher throughput.

We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.

View on arXiv PDF

Similar