CLSep 27, 2024
Evaluation of OpenAI o1: Opportunities and Challenges of AGITianyang Zhong, Zhengliang Liu, Yi Pan et al.
This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
AISep 14, 2023
Towards Artificial General Intelligence (AGI) in the Internet of Things (IoT): Opportunities and ChallengesFei Dou, Jin Ye, Geng Yuan et al.
Artificial General Intelligence (AGI), possessing the capacity to comprehend, learn, and execute tasks with human cognitive abilities, engenders significant anticipation and intrigue across scientific, commercial, and societal arenas. This fascination extends particularly to the Internet of Things (IoT), a landscape characterized by the interconnection of countless devices, sensors, and systems, collectively gathering and sharing data to enable intelligent decision-making and automation. This research embarks on an exploration of the opportunities and challenges towards achieving AGI in the context of the IoT. Specifically, it starts by outlining the fundamental principles of IoT and the critical role of Artificial Intelligence (AI) in IoT systems. Subsequently, it delves into AGI fundamentals, culminating in the formulation of a conceptual framework for AGI's seamless integration within IoT. The application spectrum for AGI-infused IoT is broad, encompassing domains ranging from smart grids, residential environments, manufacturing, and transportation to environmental monitoring, agriculture, healthcare, and education. However, adapting AGI to resource-constrained IoT settings necessitates dedicated research efforts. Furthermore, the paper addresses constraints imposed by limited computing resources, intricacies associated with large-scale IoT communication, as well as the critical concerns pertaining to security and privacy.
AIApr 24, 2023
AGI: Artificial General Intelligence for EducationEhsan Latif, Gengchen Mai, Matthew Nyaaba et al.
Artificial general intelligence (AGI) has gained global recognition as a future technology due to the emergence of breakthrough large language models and chatbots such as GPT-4 and ChatGPT, respectively. Compared to conventional AI models, typically designed for a limited range of tasks, demand significant amounts of domain-specific data for training and may not always consider intricate interpersonal dynamics in education. AGI, driven by the recent large pre-trained models, represents a significant leap in the capability of machines to perform tasks that require human-level intelligence, such as reasoning, problem-solving, decision-making, and even understanding human emotions and social interactions. This position paper reviews AGI's key concepts, capabilities, scope, and potential within future education, including achieving future educational goals, designing pedagogy and curriculum, and performing assessments. It highlights that AGI can significantly improve intelligent tutoring systems, educational assessment, and evaluation procedures. AGI systems can adapt to individual student needs, offering tailored learning experiences. They can also provide comprehensive feedback on student performance and dynamically adjust teaching methods based on student progress. The paper emphasizes that AGI's capabilities extend to understanding human emotions and social interactions, which are critical in educational settings. The paper discusses that ethical issues in education with AGI include data bias, fairness, and privacy and emphasizes the need for codes of conduct to ensure responsible AGI use in academic settings like homework, teaching, and recruitment. We also conclude that the development of AGI necessitates interdisciplinary collaborations between educators and AI engineers to advance research and application efforts.
AIApr 12, 2023
AGI for AgricultureGuoyu Lu, Sheng Li, Gengchen Mai et al.
Artificial General Intelligence (AGI) is poised to revolutionize a variety of sectors, including healthcare, finance, transportation, and education. Within healthcare, AGI is being utilized to analyze clinical medical notes, recognize patterns in patient data, and aid in patient management. Agriculture is another critical sector that impacts the lives of individuals worldwide. It serves as a foundation for providing food, fiber, and fuel, yet faces several challenges, such as climate change, soil degradation, water scarcity, and food security. AGI has the potential to tackle these issues by enhancing crop yields, reducing waste, and promoting sustainable farming practices. It can also help farmers make informed decisions by leveraging real-time data, leading to more efficient and effective farm management. This paper delves into the potential future applications of AGI in agriculture, such as agriculture image processing, natural language processing (NLP), robotics, knowledge graphs, and infrastructure, and their impact on precision livestock and precision crops. By leveraging the power of AGI, these emerging technologies can provide farmers with actionable insights, allowing for optimized decision-making and increased productivity. The transformative potential of AGI in agriculture is vast, and this paper aims to highlight its potential to revolutionize the industry.
CVAug 2, 2024
Embodiment: Self-Supervised Depth Estimation Based on Camera ModelsJinchang Zhang, Praveen Kumar Reddy, Xue-Iuan Wong et al.
Depth estimation is a critical topic for robotics and vision-related tasks. In monocular depth estimation, in comparison with supervised learning that requires expensive ground truth labeling, self-supervised methods possess great potential due to no labeling cost. However, self-supervised learning still has a large gap with supervised learning in 3D reconstruction and depth estimation performance. Meanwhile, scaling is also a major issue for monocular unsupervised depth estimation, which commonly still needs ground truth scale from GPS, LiDAR, or existing maps to correct. In the era of deep learning, existing methods primarily rely on exploring image relationships to train unsupervised neural networks, while the physical properties of the camera itself such as intrinsics and extrinsics are often overlooked. These physical properties are not just mathematical parameters; they are embodiments of the camera's interaction with the physical world. By embedding these physical properties into the deep learning model, we can calculate depth priors for ground regions and regions connected to the ground based on physical principles, providing free supervision signals without the need for additional sensors. This approach is not only easy to implement but also enhances the effects of all unsupervised methods by embedding the camera's physical properties into the model, thereby achieving an embodied understanding of the real world.
CVMay 10, 2024Code
Dual-Task Vision Transformer for Rapid and Accurate Intracerebral Hemorrhage CT Image ClassificationJialiang Fan, Xinhui Fan, Chengyan Song et al.
Intracerebral hemorrhage (ICH) is a severe and sudden medical condition caused by the rupture of blood vessels in the brain, leading to permanent damage to brain tissue and often resulting in functional disabilities or death in patients. Diagnosis and analysis of ICH typically rely on brain CT imaging. Given the urgency of ICH conditions, early treatment is crucial, necessitating rapid analysis of CT images to formulate tailored treatment plans. However, the complexity of ICH CT images and the frequent scarcity of specialist radiologists pose significant challenges. Therefore, we collect a dataset from the real world for ICH and normal classification and three types of ICH image classification based on the hemorrhage location, i.e., Deep, Subcortical, and Lobar. In addition, we propose a neural network structure, dual-task vision transformer (DTViT), for the automated classification and diagnosis of ICH images. The DTViT deploys the encoder from the Vision Transformer (ViT), employing attention mechanisms for feature extraction from CT images. The proposed DTViT framework also incorporates two multilayer perception (MLP)-based decoders to simultaneously identify the presence of ICH and classify the three types of hemorrhage locations. Experimental results demonstrate that DTViT performs well on the real-world test dataset. The code and newly collected dataset for this work are available at: https://github.com/jfan1997/DTViT.
CVSep 24, 2024
Underground Mapping and Localization Based on Ground-Penetrating RadarJinchang Zhang, Guoyu Lu
3D object reconstruction based on deep neural networks has gained increasing attention in recent years. However, 3D reconstruction of underground objects to generate point cloud maps remains a challenge. Ground Penetrating Radar (GPR) is one of the most powerful and extensively used tools for detecting and locating underground objects such as plant root systems and pipelines, with its cost-effectiveness and continuously evolving technology. This paper introduces a parabolic signal detection network based on deep convolutional neural networks, utilizing B-scan images from GPR sensors. The detected keypoints can aid in accurately fitting parabolic curves used to interpret the original GPR B-scan images as cross-sections of the object model. Additionally, a multi-task point cloud network was designed to perform both point cloud segmentation and completion simultaneously, filling in sparse point cloud maps. For unknown locations, GPR A-scan data can be used to match corresponding A-scan data in the constructed map, pinpointing the position to verify the accuracy of the map construction by the model. Experimental results demonstrate the effectiveness of our method.
CVSep 18, 2024
Depth Estimation Based on 3D Gaussian Splatting Siamese DefocusJinchang Zhang, Ningning Xu, Hao Zhang et al.
Depth estimation is a fundamental task in 3D geometry. While stereo depth estimation can be achieved through triangulation methods, it is not as straightforward for monocular methods, which require the integration of global and local information. The Depth from Defocus (DFD) method utilizes camera lens models and parameters to recover depth information from blurred images and has been proven to perform well. However, these methods rely on All-In-Focus (AIF) images for depth estimation, which is nearly impossible to obtain in real-world applications. To address this issue, we propose a self-supervised framework based on 3D Gaussian splatting and Siamese networks. By learning the blur levels at different focal distances of the same scene in the focal stack, the framework predicts the defocus map and Circle of Confusion (CoC) from a single defocused image, using the defocus map as input to DepthNet for monocular depth estimation. The 3D Gaussian splatting model renders defocused images using the predicted CoC, and the differences between these and the real defocused images provide additional supervision signals for the Siamese Defocus self-supervised network. This framework has been validated on both artificially synthesized and real blurred datasets. Subsequent quantitative and visualization experiments demonstrate that our proposed framework is highly effective as a DFD method.
ROFeb 9, 2024
LLMs for Coding and Robotics EducationPeng Shu, Huaqin Zhao, Hanqi Jiang et al.
Large language models and multimodal large language models have revolutionized artificial intelligence recently. An increasing number of regions are now embracing these advanced technologies. Within this context, robot coding education is garnering increasing attention. To teach young children how to code and compete in robot challenges, large language models are being utilized for robot code explanation, generation, and modification. In this paper, we highlight an important trend in robot coding education. We test several mainstream large language models on both traditional coding tasks and the more challenging task of robot code generation, which includes block diagrams. Our results show that GPT-4V outperforms other models in all of our tests but struggles with generating block diagram images.
CVMar 18, 2025
Vision-Language Embodiment for Monocular Depth EstimationJinchang Zhang, Guoyu Lu
Depth estimation is a core problem in robotic perception and vision tasks, but 3D reconstruction from a single image presents inherent uncertainties. Current depth estimation models primarily rely on inter-image relationships for supervised training, often overlooking the intrinsic information provided by the camera itself. We propose a method that embodies the camera model and its physical characteristics into a deep learning model, computing embodied scene depth through real-time interactions with road environments. The model can calculate embodied scene depth in real-time based on immediate environmental changes using only the intrinsic properties of the camera, without any additional equipment. By combining embodied scene depth with RGB image features, the model gains a comprehensive perspective on both geometric and visual details. Additionally, we incorporate text descriptions containing environmental content and depth information as priors for scene understanding, enriching the model's perception of objects. This integration of image and language - two inherently ambiguous modalities - leverages their complementary strengths for monocular depth estimation. The real-time nature of the embodied language and depth prior model ensures that the model can continuously adjust its perception and behavior in dynamic environments. Experimental results show that the embodied depth estimation method enhances model performance across different scenes.
CVMar 11, 2025
Keypoint Detection and Description for Raw Bayer ImagesJiakai Lin, Jinchang Zhang, Guoyu Lu
Keypoint detection and local feature description are fundamental tasks in robotic perception, critical for applications such as SLAM, robot localization, feature matching, pose estimation, and 3D mapping. While existing methods predominantly operate on RGB images, we propose a novel network that directly processes raw images, bypassing the need for the Image Signal Processor (ISP). This approach significantly reduces hardware requirements and memory consumption, which is crucial for robotic vision systems. Our method introduces two custom-designed convolutional kernels capable of performing convolutions directly on raw images, preserving inter-channel information without converting to RGB. Experimental results show that our network outperforms existing algorithms on raw images, achieving higher accuracy and stability under large rotations and scale variations. This work represents the first attempt to develop a keypoint detection and feature description network specifically for raw images, offering a more efficient solution for resource-constrained environments.
CVMar 11, 2025
Language-Depth Navigated Thermal and Visible Image FusionJinchang Zhang, Zijun Li, Guoyu Lu
Depth-guided multimodal fusion combines depth information from visible and infrared images, significantly enhancing the performance of 3D reconstruction and robotics applications. Existing thermal-visible image fusion mainly focuses on detection tasks, ignoring other critical information such as depth. By addressing the limitations of single modalities in low-light and complex environments, the depth information from fused images not only generates more accurate point cloud data, improving the completeness and precision of 3D reconstruction, but also provides comprehensive scene understanding for robot navigation, localization, and environmental perception. This supports precise recognition and efficient operations in applications such as autonomous driving and rescue missions. We introduce a text-guided and depth-driven infrared and visible image fusion network. The model consists of an image fusion branch for extracting multi-channel complementary information through a diffusion model, equipped with a text-guided module, and two auxiliary depth estimation branches. The fusion branch uses CLIP to extract semantic information and parameters from depth-enriched image descriptions to guide the diffusion model in extracting multi-channel features and generating fused images. These fused images are then input into the depth estimation branches to calculate depth-driven loss, optimizing the image fusion network. This framework aims to integrate vision-language and depth to directly generate color-fused images from multimodal inputs.
61.7CVMar 16
Fractal Autoregressive Depth Estimation with Continuous Token DiffusionJinchang Zhang, Xinrou Kang, Guoyu Lu
Monocular depth estimation can benefit from autoregressive (AR) generation, but direct AR modeling is hindered by the modality gap between RGB and depth, inefficient pixel-wise generation, and instability in continuous depth prediction. We propose a Fractal Visual Autoregressive Diffusion framework that reformulates depth estimation as a coarse-to-fine, next-scale autoregressive generation process. A VCFR module fuses multi-scale image features with current depth predictions to improve cross-modal conditioning, while a conditional denoising diffusion loss models depth distributions directly in continuous space and mitigates errors caused by discrete quantization. To improve computational efficiency, we organize the scale-wise generators into a fractal recursive architecture, reusing a base visual AR unit in a self-similar hierarchy. We further introduce an uncertainty-aware robust consensus aggregation scheme for multi-sample inference to improve fusion stability and provide a practical pixel-wise reliability estimate. Experiments on standard benchmarks demonstrate strong performance and validate the effectiveness of the proposed design.
CVOct 2, 2025
Automated Genomic Interpretation via Concept Bottleneck Models for Medical RoboticsZijun Li, Jinchang Zhang, Ming Zhang et al.
We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state of the art classification performance, superior concept prediction fidelity, and more favorable cost benefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.
CVOct 1, 2025
Graph Integrated Multimodal Concept Bottleneck ModelJiakai Lin, Jinchang Zhang, Guoyu Lu
With growing demand for interpretability in deep learning, especially in high stakes domains, Concept Bottleneck Models (CBMs) address this by inserting human understandable concepts into the prediction pipeline, but they are generally single modal and ignore structured concept relationships. To overcome these limitations, we present MoE-SGT, a reasoning driven framework that augments CBMs with a structure injecting Graph Transformer and a Mixture of Experts (MoE) module. We construct answer-concept and answer-question graphs for multimodal inputs to explicitly model the structured relationships among concepts. Subsequently, we integrate Graph Transformer to capture multi level dependencies, addressing the limitations of traditional Concept Bottleneck Models in modeling concept interactions. However, it still encounters bottlenecks in adapting to complex concept patterns. Therefore, we replace the feed forward layers with a Mixture of Experts (MoE) module, enabling the model to have greater capacity in learning diverse concept relationships while dynamically allocating reasoning tasks to different sub experts, thereby significantly enhancing the model's adaptability to complex concept reasoning. MoE-SGT achieves higher accuracy than other concept bottleneck networks on multiple datasets by modeling structured relationships among concepts and utilizing a dynamic expert selection mechanism.
CVOct 1, 2025
Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge DistillationJinchang Zhang, Zijun Li, Jiakai Lin et al.
Event cameras offer advantages in object detection tasks due to high-speed response, low latency, and robustness to motion blur. However, event cameras lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework that leverages CLIP's semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as inputs to a teacher model, guiding the event-based student model to learn CLIP's rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs while inheriting CLIP's broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid spiking neural network (SNN) and convolutional neural network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.
CVAug 11, 2025
3D Plant Root Skeleton Detection and ExtractionJiakai Lin, Jinchang Zhang, Ge Jin et al.
Plant roots typically exhibit a highly complex and dense architecture, incorporating numerous slender lateral roots and branches, which significantly hinders the precise capture and modeling of the entire root system. Additionally, roots often lack sufficient texture and color information, making it difficult to identify and track root traits using visual methods. Previous research on roots has been largely confined to 2D studies; however, exploring the 3D architecture of roots is crucial in botany. Since roots grow in real 3D space, 3D phenotypic information is more critical for studying genetic traits and their impact on root development. We have introduced a 3D root skeleton extraction method that efficiently derives the 3D architecture of plant roots from a few images. This method includes the detection and matching of lateral roots, triangulation to extract the skeletal structure of lateral roots, and the integration of lateral and primary roots. We developed a highly complex root dataset and tested our method on it. The extracted 3D root skeletons showed considerable similarity to the ground truth, validating the effectiveness of the model. This method can play a significant role in automated breeding robots. Through precise 3D root structure analysis, breeding robots can better identify plant phenotypic traits, especially root structure and growth patterns, helping practitioners select seeds with superior root systems. This automated approach not only improves breeding efficiency but also reduces manual intervention, making the breeding process more intelligent and efficient, thus advancing modern agriculture.
CVMay 17, 2023
SAM for Poultry ScienceXiao Yang, Haixing Dai, Zihao Wu et al.
In recent years, the agricultural industry has witnessed significant advancements in artificial intelligence (AI), particularly with the development of large-scale foundational models. Among these foundation models, the Segment Anything Model (SAM), introduced by Meta AI Research, stands out as a groundbreaking solution for object segmentation tasks. While SAM has shown success in various agricultural applications, its potential in the poultry industry, specifically in the context of cage-free hens, remains relatively unexplored. This study aims to assess the zero-shot segmentation performance of SAM on representative chicken segmentation tasks, including part-based segmentation and the use of infrared thermal images, and to explore chicken-tracking tasks by using SAM as a segmentation tool. The results demonstrate SAM's superior performance compared to SegFormer and SETR in both whole and part-based chicken segmentation. SAM-based object tracking also provides valuable data on the behavior and movement patterns of broiler birds. The findings of this study contribute to a better understanding of SAM's potential in poultry science and lay the foundation for future advancements in chicken segmentation and tracking.
CVMar 15, 2021
Fast and Accurate: Video Enhancement using Sparse DepthYu Feng, Patrick Hansen, Paul N. Whatmough et al.
This paper presents a general framework to build fast and accurate algorithms for video enhancement tasks such as super-resolution, deblurring, and denoising. Essential to our framework is the realization that the accuracy, rather than the density, of pixel flows is what is required for high-quality video enhancement. Most of prior works take the opposite approach: they estimate dense (per-pixel)-but generally less robust-flows, mostly using computationally costly algorithms. Instead, we propose a lightweight flow estimation algorithm; it fuses the sparse point cloud data and (even sparser and less reliable) IMU data available in modern autonomous agents to estimate the flow information. Building on top of the flow estimation, we demonstrate a general framework that integrates the flows in a plug-and-play fashion with different task-specific layers. Algorithms built in our framework achieve 1.78x - 187.41x speedup while providing a 0.42 dB - 6.70 dB quality improvement over competing methods.