Zhikang Zhang

CV
h-index61
13papers
181citations
Novelty52%
AI Score50

13 Papers

AIMar 17, 2025
The Amazon Nova Family of Models: Technical Report and Model Card

Amazon AGI, Aaron Langford, Aayush Shah et al. · amazon-science

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.

CVNov 8, 2022
Enhanced Low-resolution LiDAR-Camera Calibration Via Depth Interpolation and Supervised Contrastive Learning

Zhikang Zhang, Zifan Yu, Suya You et al. · amazon-science

Motivated by the increasing application of low-resolution LiDAR recently, we target the problem of low-resolution LiDAR-camera calibration in this work. The main challenges are two-fold: sparsity and noise in point clouds. To address the problem, we propose to apply depth interpolation to increase the point density and supervised contrastive learning to learn noise-resistant features. The experiments on RELLIS-3D demonstrate that our approach achieves an average mean absolute rotation/translation errors of 0.15cm/0.33\textdegree on 32-channel LiDAR point cloud data, which significantly outperforms all reference methods.

CVAug 1, 2024
Text-Guided Video Masked Autoencoder

David Fan, Jue Wang, Shuai Liao et al. · amazon-science

Recent video masked autoencoder (MAE) works have designed improved masking algorithms focused on saliency. These works leverage visual cues such as motion to mask the most salient regions. However, the robustness of such visual cues depends on how often input videos match underlying assumptions. On the other hand, natural language description is an information dense representation of video that implicitly captures saliency without requiring modality-specific assumptions, and has not been explored yet for video MAE. To this end, we introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions. Without leveraging any explicit visual cues for saliency, our TGM is competitive with state-of-the-art masking algorithms such as motion-guided masking. To further benefit from the semantics of natural language for masked reconstruction, we next introduce a unified framework for joint MAE and masked video-text contrastive learning. We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE on a variety of video recognition tasks, especially for linear probe. Within this unified framework, our TGM achieves the best relative performance on five action recognition and one egocentric datasets, highlighting the complementary nature of natural language for masked video modeling.

CVFeb 16, 2023
TransUPR: A Transformer-based Uncertain Point Refiner for LiDAR Point Cloud Semantic Segmentation

Zifan Yu, Meida Chen, Zhikang Zhang et al. · amazon-science

Common image-based LiDAR point cloud semantic segmentation (LiDAR PCSS) approaches have bottlenecks resulting from the boundary-blurring problem of convolution neural networks (CNNs) and quantitation loss of spherical projection. In this work, we propose a transformer-based plug-and-play uncertain point refiner, i.e., TransUPR, to refine selected uncertain points in a learnable manner, which leads to an improved segmentation performance. Uncertain points are sampled from coarse semantic segmentation results of 2D image segmentation where uncertain points are located close to the object boundaries in the 2D range image representation and 3D spherical projection background points. Following that, the geometry and coarse semantic features of uncertain points are aggregated by neighbor points in 3D space without adding expensive computation and memory footprint. Finally, the transformer-based refiner, which contains four stacked self-attention layers, along with an MLP module, is utilized for uncertain point classification on the concatenated features of self-attention layers. As the proposed refiner is independent of 2D CNNs, our TransUPR can be easily integrated into any existing image-based LiDAR PCSS approaches, e.g., CENet. Our TransUPR with the CENet achieves state-of-the-art performance, i.e., 68.2% mean Intersection over Union (mIoU) on the Semantic KITTI benchmark, which provides a performance improvement of 0.6% on the mIoU compared to the original CENet.

CVNov 8, 2022
Automatic Error Detection in Integrated Circuits Image Segmentation: A Data-driven Approach

Zhikang Zhang, Bruno Machado Trindade, Michael Green et al. · amazon-science

Due to the complicated nanoscale structures of current integrated circuits(IC) builds and low error tolerance of IC image segmentation tasks, most existing automated IC image segmentation approaches require human experts for visual inspection to ensure correctness, which is one of the major bottlenecks in large-scale industrial applications. In this paper, we present the first data-driven automatic error detection approach targeting two types of IC segmentation errors: wire errors and via errors. On an IC image dataset collected from real industry, we demonstrate that, by adapting existing CNN-based approaches of image classification and image translation with additional pre-processing and post-processing techniques, we are able to achieve recall/precision of 0.92/0.93 in wire error detection and 0.96/0.90 in via error detection, respectively.

CVJan 29
Token Entropy Regularization for Multi-modal Antenna Affiliation Identification

Dong Chen, Ruoyu Li, Xinyan Zhang et al.

Accurate antenna affiliation identification is crucial for optimizing and maintaining communication networks. Current practice, however, relies on the cumbersome and error-prone process of manual tower inspections. We propose a novel paradigm shift that fuses video footage of base stations, antenna geometric features, and Physical Cell Identity (PCI) signals, transforming antenna affiliation identification into multi-modal classification and matching tasks. Publicly available pretrained transformers struggle with this unique task due to a lack of analogous data in the communications domain, which hampers cross-modal alignment. To address this, we introduce a dedicated training framework that aligns antenna images with corresponding PCI signals. To tackle the representation alignment challenge, we propose a novel Token Entropy Regularization module in the pretraining stage. Our experiments demonstrate that TER accelerates convergence and yields significant performance gains. Further analysis reveals that the entropy of the first token is modality-dependent. Code will be made available upon publication.

CVFeb 28, 2021Code
OpenICS: Open Image Compressive Sensing Toolbox and Benchmark

Jonathan Zhao, Matthew Westerham, Mark Lakatos-Toth et al.

We present OpenICS, an image compressive sensing toolbox that includes multiple image compressive sensing and reconstruction algorithms proposed in the past decade. Due to the lack of standardization in the implementation and evaluation of the proposed algorithms, the application of image compressive sensing in the real-world is limited. We believe this toolbox is the first framework that provides a unified and standardized implementation of multiple image compressive sensing algorithms. In addition, we also conduct a benchmarking study on the methods included in this framework from two aspects: reconstruction accuracy and reconstruction efficiency. We wish this toolbox and benchmark can serve the growing research community of compressive sensing and the industry applying image compressive sensing to new problems as well as developing new methods more efficiently. Code and models are available at https://github.com/PSCLab-ASU/OpenICS. The project is still under maintenance, and we will keep this document updated.

73.3CVMay 10
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

Bo Gu, Zhikang Zhang, Zizhuang Wei et al.

Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.

CVOct 31, 2024
Video Token Merging for Long-form Video Understanding

Seon-Ho Lee, Jue Wang, Zhikang Zhang et al. · amazon-science

As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video processing is not trivial. We begin with the premise that token merging should not rely solely on the similarity of video tokens; the saliency of tokens should also be considered. To address this, we explore various video token merging strategies for long-form video classification, starting with a simple extension of image token merging, moving to region-concentrated merging, and finally proposing a learnable video token merging (VTM) algorithm that dynamically merges tokens based on their saliency. Extensive experimental results show that we achieve better or comparable performances on the LVU, COIN, and Breakfast datasets. Moreover, our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.

CVDec 13, 2024
NowYouSee Me: Context-Aware Automatic Audio Description

Seon-Ho Lee, Jue Wang, David Fan et al. · amazon-science

Audio Description (AD) plays a pivotal role as an application system aimed at guaranteeing accessibility in multimedia content, which provides additional narrations at suitable intervals to describe visual elements, catering specifically to the needs of visually impaired audiences. In this paper, we introduce $\mathrm{CA^3D}$, the pioneering unified Context-Aware Automatic Audio Description system that provides AD event scripts with precise locations in the long cinematic content. Specifically, $\mathrm{CA^3D}$ system consists of: 1) a Temporal Feature Enhancement Module to efficiently capture longer term dependencies, 2) an anchor-based AD event detector with feature suppression module that localizes the AD events and extracts discriminative feature for AD generation, and 3) a self-refinement module that leverages the generated output to tweak AD event boundaries from coarse to fine. Unlike conventional methods which rely on metadata and ground truth AD timestamp for AD detection and generation tasks, the proposed $\mathrm{CA^3D}$ is the first end-to-end trainable system that only uses visual cue. Extensive experiments demonstrate that the proposed $\mathrm{CA^3D}$ improves existing architectures for both AD event detection and script generation metrics, establishing the new state-of-the-art performances in the AD automation.

CVDec 10, 2024
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning

Yicheng Wang, Zhikang Zhang, Jue Wang et al. · amazon-science

In various video-language learning tasks, the challenge of achieving cross-modality alignment with multi-grained data persists. We propose a method to tackle this challenge from two crucial perspectives: data and modeling. Given the absence of a multi-grained video-text pretraining dataset, we introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset. To better model multi-grained data, we introduce an Iterative Approximation Module (IAM), which embeds multi-grained videos and texts into a unified, low-dimensional semantic space while preserving essential information for cross-modal alignment. Furthermore, GEXIA is highly scalable with no restrictions on the number of video-text granularities for alignment. We evaluate our work on three categories of video tasks across seven benchmark datasets, showcasing state-of-the-art or comparable performance. Remarkably, our model excels in tasks involving long-form video understanding, even though the pretraining dataset only contains short video clips.

CVNov 28, 2025
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Ruosen Zhao, Zhikang Zhang, Jialei Xu et al.

Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.

CVJul 24, 2018
LAPRAN: A Scalable Laplacian Pyramid Reconstructive Adversarial Network for Flexible Compressive Sensing Reconstruction

Kai Xu, Zhikang Zhang, Fengbo Ren

This paper addresses the single-image compressive sensing (CS) and reconstruction problem. We propose a scalable Laplacian pyramid reconstructive adversarial network (LAPRAN) that enables high-fidelity, flexible and fast CS images reconstruction. LAPRAN progressively reconstructs an image following the concept of Laplacian pyramid through multiple stages of reconstructive adversarial networks (RANs). At each pyramid level, CS measurements are fused with a contextual latent vector to generate a high-frequency image residual. Consequently, LAPRAN can produce hierarchies of reconstructed images and each with an incremental resolution and improved quality. The scalable pyramid structure of LAPRAN enables high-fidelity CS reconstruction with a flexible resolution that is adaptive to a wide range of compression ratios (CRs), which is infeasible with existing methods. Experimental results on multiple public datasets show that LAPRAN offers an average 7.47dB and 5.98dB PSNR, and an average 57.93% and 33.20% SSIM improvement compared to model-based and data-driven baselines, respectively.