RONov 2, 2025
Maestro: Orchestrating Robotics Modules with Vision-Language Models for Zero-Shot Generalist RobotsJunyao Shi, Rujia Yang, Kaitian Chao et al.
Today's best-explored routes towards generalist robots center on collecting ever larger "observations-in actions-out" robotics datasets to train large end-to-end models, copying a recipe that has worked for vision-language models (VLMs). We pursue a road less traveled: building generalist policies directly around VLMs by augmenting their general capabilities with specific robot capabilities encapsulated in a carefully curated set of perception, planning, and control modules. In Maestro, a VLM coding agent dynamically composes these modules into a programmatic policy for the current task and scenario. Maestro's architecture benefits from a streamlined closed-loop interface without many manually imposed structural constraints, and a comprehensive and diverse tool repertoire. As a result, it largely surpasses today's VLA models for zero-shot performance on challenging manipulation skills. Further, Maestro is easily extensible to incorporate new modules, easily editable to suit new embodiments such as a quadruped-mounted arm, and even easily adapts from minimal real-world experiences through local code edits.
CLMay 23, 2025Code
Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language ModelsZekai Zhao, Qi Liu, Kun Zhou et al.
Despite the remarkable reasoning performance, eliciting the long chain-of-thought (CoT) ability in large language models (LLMs) typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers largely governs long-form reasoning attributes, such as output length and self-reflection. By simply amplifying these activations and inserting "wait" tokens, we can invoke the long CoT ability without any training, resulting in significantly increased self-reflection rates and accuracy. Moreover, we find that the activation dynamics follow predictable trajectories, with a sharp rise after special tokens and a subsequent exponential decay. Building on these insights, we introduce a general training-free activation control technique. It leverages a few contrastive examples to identify key activations, and employs simple analytic functions to modulate their values at inference time to elicit long CoTs. Extensive experiments confirm the effectiveness of our method in efficiently eliciting long CoT reasoning in LLMs and improving their performance. Additionally, we propose a parameter-efficient fine-tuning method that trains only a last-layer activation amplification module and a few LoRA layers, outperforming full LoRA fine-tuning on reasoning benchmarks with significantly fewer parameters. Our code and data are publicly released.
CVJul 12, 2025Code
Automated Multi-Class Crop Pathology Classification via Convolutional Neural Networks: A Deep Learning Approach for Real-Time Precision AgricultureSourish Suri, Yifei Shao
Crop diseases present a significant barrier to agricultural productivity and global food security, especially in large-scale farming where early identification is often delayed or inaccurate. This research introduces a Convolutional Neural Network (CNN)-based image classification system designed to automate the detection and classification of eight common crop diseases using leaf imagery. The methodology involves a complete deep learning pipeline: image acquisition from a large, labeled dataset, preprocessing via resizing, normalization, and augmentation, and model training using TensorFlow with Keras' Sequential API. The CNN architecture comprises three convolutional layers with increasing filter sizes and ReLU activations, followed by max pooling, flattening, and fully connected layers, concluding with a softmax output for multi-class classification. The system achieves high training accuracy (~90%) and demonstrates reliable performance on unseen data, although a validation accuracy of ~60% suggests minor overfitting. Notably, the model integrates a treatment recommendation module, providing actionable guidance by mapping each detected disease to suitable pesticide or fungicide interventions. Furthermore, the solution is deployed on an open-source, mobile-compatible platform, enabling real-time image-based diagnostics for farmers in remote areas. This research contributes a scalable and accessible tool to the field of precision agriculture, reducing reliance on manual inspection and promoting sustainable disease management practices. By merging deep learning with practical agronomic support, this work underscores the potential of CNNs to transform crop health monitoring and enhance food production resilience on a global scale.
CLMay 19, 2025Code
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsYanbin Yin, Kun Zhou, Zhen Wang et al.
The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (eg Chatbot Arena) rely on costly and slow human judges. Recently, automated methods (eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few "authority" models. To tackle these issues, we propose Decentralized Arena (dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions. Across extensive experiments across 66 LLMs, dearena attains up to 97% correlation with human judgements, while significantly reducing the cost. Our code and data will be publicly released on https://github.com/maitrix-org/de-arena.
LGMay 23, 2025
Towards General Continuous Memory for Vision-Language ModelsWenyi Wu, Zixuan Song, Kun Zhou et al.
Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual real-world knowledge. To support such capabilities, an external memory system that can efficiently provide relevant multimodal information is essential. Existing approaches generally concatenate image and text tokens into a long sequence as memory, which, however, may drastically increase context length and even degrade performance. In contrast, we propose using continuous memory, a compact set of dense embeddings to more effectively and efficiently represent multimodal and multilingual knowledge. Our key insight is that a VLM can serve as its own continuous memory encoder. We empirically show that this design improves performance on complex multimodal reasoning tasks. Building on this, we introduce a data-efficient and parameter-efficient method to fine-tune the VLM into a memory encoder, requiring only 1.2% of the model's parameters and a small corpus of 15.6K self-synthesized samples. Our approach CoMEM utilizes VLM's original capabilities to encode arbitrary multimodal and multilingual knowledge into just 8 continuous embeddings. Since the inference-time VLM remains frozen, our memory module is plug-and-play and can be flexibly integrated as needed. Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach.