83.6ROMay 18
Key-Gram: Extensible World Knowledge for Embodied ManipulationJingjing Fan, Siyuan Li, Botao Ren et al.
Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone to devote its main capacity to visual reasoning and action inference, while reusable instruction knowledge is stored in an extensible external memory. The logical memory table can be conveniently partitioned during training and, due to its $O(1)$ lookup pattern, efficiently placed on host memory during inference. Across RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation, Key-Gram consistently improves both $π_{0}$ and $π_{0.5}$ backbones, with average relative gains of $29.5\%/9.9\%$ on RoboTwin2.0, $35.8\%/4.5\%$ on LIBERO-Plus transfer without target-domain fine-tuning, and $15.4\%/8.1\%$ on real-world long-horizon tasks. These results demonstrate that externalized linguistic memory provides an effective and extensible mechanism for improving compositional grounding, transfer, and real-world manipulation.
ROFeb 5
FUTURE-VLA: Forecasting Unified Trajectories Under Real-time ExecutionJingjing Fan, Yushan Liu, Shoujie Li et al.
General vision-language models increasingly support unified spatiotemporal reasoning over long video streams, yet deploying such capabilities on robots remains constrained by the prohibitive latency of processing long-horizon histories and generating high-dimensional future predictions. To bridge this gap, we present FUTURE-VLA, a unified architecture that reformulates long-horizon control and future forecasting as a monolithic sequence-generation task. Adopting a dual-sided efficiency paradigm, FUTURE-VLA leverages a temporally adaptive compression strategy to maximize spatiotemporal information density, enabling the ingestion of extensive multi-view histories while maintaining constant inference latency. Simultaneously, it performs latent-space autoregression to align actionable dynamics with reviewable visual look-aheads in a single forward pass. These real-time predictive capabilities further enable a prediction-guided Human-In-the-Loop mechanism via interactive execution gating, allowing operators to dynamically validate behaviors based on interpretable future previews. Extensive evaluations demonstrate that FUTURE-VLA establishes new state-of-the-art performance, attaining success rates of 99.2% on LIBERO, 75.4% on RoboTwin, and 78.0% on a real-world Piper platform, all with a $16\times$ extended spatiotemporal window while maintaining the inference latency of a single-frame baseline.
LGFeb 17, 2024
Neural Networks with (Low-Precision) Polynomial Approximations: New Insights and Techniques for Accuracy ImprovementChi Zhang, Jingjing Fan, Man Ho Au et al.
Replacing non-polynomial functions (e.g., non-linear activation functions such as ReLU) in a neural network with their polynomial approximations is a standard practice in privacy-preserving machine learning. The resulting neural network, called polynomial approximation of neural network (PANN) in this paper, is compatible with advanced cryptosystems to enable privacy-preserving model inference. Using ``highly precise'' approximation, state-of-the-art PANN offers similar inference accuracy as the underlying backbone model. However, little is known about the effect of approximation, and existing literature often determined the required approximation precision empirically. In this paper, we initiate the investigation of PANN as a standalone object. Specifically, our contribution is two-fold. Firstly, we provide an explanation on the effect of approximate error in PANN. In particular, we discovered that (1) PANN is susceptible to some type of perturbations; and (2) weight regularisation significantly reduces PANN's accuracy. We support our explanation with experiments. Secondly, based on the insights from our investigations, we propose solutions to increase inference accuracy for PANN. Experiments showed that combination of our solutions is very effective: at the same precision, our PANN is 10% to 50% more accurate than state-of-the-arts; and at the same accuracy, our PANN only requires a precision of 2^{-9} while state-of-the-art solution requires a precision of 2^{-12} using the ResNet-20 model on CIFAR-10 dataset.