CVFeb 24Code
SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object PerceptionJose Moises Araya-Martinez, Thushar Tom, Adrián Sanchis Reig et al.
Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning perception models require large datasets for robust automation under semi-uncontrolled conditions. The cost of acquiring and annotating such data for proprietary parts is a major barrier for widespread deployment. In this context, we release SynthRender, an open source framework for synthetic image generation with Guided Domain Randomization capabilities. Furthermore, we benchmark recent Reality-to-Simulation techniques for 3D asset creation from 2D images of real parts. Combined with Domain Randomization, these synthetic assets provide low-overhead, transferable data even for parts lacking 3D files. We also introduce IRIS, the Industrial Real-Sim Imagery Set, containing 32 categories with diverse textures, intra-class variation, strong inter-class similarities and about 20,000 labels. Ablations on multiple benchmarks outline guidelines for efficient data generation with SynthRender. Our method surpasses existing approaches, achieving 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.
CVNov 24, 2025Code
IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric AssistantsVivek Chavan, Yasmina Imgrund, Tung Dao et al.
We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo
CVJun 5, 2025Code
Physical Annotation for Automated Optical Inspection: A Concept for In-Situ, Pointer-Based Training Data GenerationOliver Krumpek, Oliver Heimann, Jörg Krüger
This paper introduces a novel physical annotation system designed to generate training data for automated optical inspection. The system uses pointer-based in-situ interaction to transfer the valuable expertise of trained inspection personnel directly into a machine learning (ML) training pipeline. Unlike conventional screen-based annotation methods, our system captures physical trajectories and contours directly on the object, providing a more intuitive and efficient way to label data. The core technology uses calibrated, tracked pointers to accurately record user input and transform these spatial interactions into standardised annotation formats that are compatible with open-source annotation software. Additionally, a simple projector-based interface projects visual guidance onto the object to assist users during the annotation process, ensuring greater accuracy and consistency. The proposed concept bridges the gap between human expertise and automated data generation, enabling non-IT experts to contribute to the ML training pipeline and preventing the loss of valuable training samples. Preliminary evaluation results confirm the feasibility of capturing detailed annotation trajectories and demonstrate that integration with CVAT streamlines the workflow for subsequent ML tasks. This paper details the system architecture, calibration procedures and interface design, and discusses its potential contribution to future ML data generation for automated optical inspection.
CVJun 11, 2024Code
On the Application of Egocentric Computer Vision to Industrial ScenariosVivek Chavan, Oliver Heimann, Jörg Krüger
Egocentric vision aims to capture and analyse the world from the first-person perspective. We explore the possibilities for egocentric wearable devices to improve and enhance industrial use cases w.r.t. data collection, annotation, labelling and downstream applications. This would contribute to easier data collection and allow users to provide additional context. We envision that this approach could serve as a supplement to the traditional industrial Machine Vision workflow. Code, Dataset and related resources will be available at: https://github.com/Vivek9Chavan/EgoVis24
HCJun 14, 2025
Feeling Machines: Ethics, Culture, and the Rise of Emotional AIVivek Chavan, Arsen Cenaj, Shuyuan Shen et al.
This paper explores the growing presence of emotionally responsive artificial intelligence through a critical and interdisciplinary lens. Bringing together the voices of early-career researchers from multiple fields, it explores how AI systems that simulate or interpret human emotions are reshaping our interactions in areas such as education, healthcare, mental health, caregiving, and digital life. The analysis is structured around four central themes: the ethical implications of emotional AI, the cultural dynamics of human-machine interaction, the risks and opportunities for vulnerable populations, and the emerging regulatory, design, and technical considerations. The authors highlight the potential of affective AI to support mental well-being, enhance learning, and reduce loneliness, as well as the risks of emotional manipulation, over-reliance, misrepresentation, and cultural bias. Key challenges include simulating empathy without genuine understanding, encoding dominant sociocultural norms into AI systems, and insufficient safeguards for individuals in sensitive or high-risk contexts. Special attention is given to children, elderly users, and individuals with mental health challenges, who may interact with AI in emotionally significant ways. However, there remains a lack of cognitive or legal protections which are necessary to navigate such engagements safely. The report concludes with ten recommendations, including the need for transparency, certification frameworks, region-specific fine-tuning, human oversight, and longitudinal research. A curated supplementary section provides practical tools, models, and datasets to support further work in this domain.
CVNov 28, 2025
Synthetic Industrial Object Detection: GenAI vs. Feature-Based MethodsJose Moises Araya-Martinez, Adrián Sanchis Reig, Gautham Mohan et al.
Reducing the burden of data generation and annotation remains a major challenge for the cost-effective deployment of machine learning in industrial and robotics settings. While synthetic rendering is a promising solution, bridging the sim-to-real gap often requires expert intervention. In this work, we benchmark a range of domain randomization (DR) and domain adaptation (DA) techniques, including feature-based methods, generative AI (GenAI), and classical rendering approaches, for creating contextualized synthetic data without manual annotation. Our evaluation focuses on the effectiveness and efficiency of low-level and high-level feature alignment, as well as a controlled diffusion-based DA method guided by prompts generated from real-world contexts. We validate our methods on two datasets: a proprietary industrial dataset (automotive and logistics) and a public robotics dataset. Results show that if render-based data with enough variability is available as seed, simpler feature-based methods, such as brightness-based and perceptual hashing filtering, outperform more complex GenAI-based approaches in both accuracy and resource efficiency. Perceptual hashing consistently achieves the highest performance, with mAP50 scores of 98% and 67% on the industrial and robotics datasets, respectively. Additionally, GenAI methods present significant time overhead for data generation at no apparent improvement of sim-to-real mAP values compared to simpler methods. Our findings offer actionable insights for efficiently bridging the sim-to-real gap, enabling high real-world performance from models trained exclusively on synthetic data.
CVNov 28, 2025
Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin SimulationJose Moises Araya-Martinez, Gautham Mohan, Kenichi Hayakawa Bolaños et al.
Early-stage visual quality inspection is vital for achieving Zero-Defect Manufacturing and minimizing production waste in modern industrial environments. However, the complexity of robust visual inspection systems and their extensive data requirements hinder widespread adoption in semi-controlled industrial settings. In this context, we propose a pose-agnostic, zero-shot quality inspection framework that compares real scenes against real-time Digital Twins (DT) in the RGB-D space. Our approach enables efficient real-time DT rendering by semantically describing industrial scenes through object detection and pose estimation of known Computer-Aided Design models. We benchmark tools for real-time, multimodal RGB-D DT creation while tracking consumption of computational resources. Additionally, we provide an extensible and hierarchical annotation strategy for multi-criteria defect detection, unifying pose labelling with logical and structural defect annotations. Based on an automotive use case featuring the quality inspection of an axial flux motor, we demonstrate the effectiveness of our framework. Our results demonstrate detection performace, achieving intersection-over-union (IoU) scores of up to 63.3% compared to ground-truth masks, even if using simple distance measurements under semi-controlled industrial conditions. Our findings lay the groundwork for future research on generalizable, low-data defect detection methods in dynamic manufacturing settings.
CVMar 25, 2025
Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image EncodersPaul Koch, Jörg Krüger, Ankit Chowdhury et al.
Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void's depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.
CVFeb 21, 2025
MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part RecognitionPaul Koch, Marian Schlüter, Jörg Krüger
We present MVIP, a novel dataset for multi-modal and multi-view application-oriented industrial part recognition. Here we are the first to combine a calibrated RGBD multi-view dataset with additional object context such as physical properties, natural language, and super-classes. The current portfolio of available datasets offers a wide range of representations to design and benchmark related methods. In contrast to existing classification challenges, industrial recognition applications offer controlled multi-modal environments but at the same time have different problems than traditional 2D/3D classification challenges. Frequently, industrial applications must deal with a small amount or increased number of training data, visually similar parts, and varying object sizes, while requiring a robust near 100% top 5 accuracy under cost and time constraints. Current methods tackle such challenges individually, but direct adoption of these methods within industrial applications is complex and requires further research. Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks towards an efficient deployment of industrial classifiers. Additionally, we intend to push with MVIP research regarding several modality fusion topics, (automated) synthetic data generation, and complex data sampling -- combined in a single application-oriented benchmark.
ROFeb 2, 2022
Towards High-Payload Admittance Control for Manual Guidance with Environmental ContactKevin Haninger, Marcel Radke, Axel Vick et al.
Force control enables hands-on teaching and physical collaboration, with the potential to improve ergonomics and flexibility of automation. Established methods for the design of compliance, impedance control, and \rev{collision response} can achieve free-space stability and acceptable peak contact force on lightweight, lower payload robots. Scaling collaboration to higher payloads can allow new applications, but introduces challenges due to the more significant payload dynamics and the use of higher-payload industrial robots. To achieve high-payload manual guidance with contact, this paper proposes and validates new mechatronic design methods: standard admittance control is extended with damping feedback, compliant structures are integrated to the environment, and a contact response method which allows continuous admittance control is proposed. These methods are compared with respect to free-space stability, contact stability, and peak contact force. The resulting methods are then applied to realize two contact-rich tasks on a 16 kg payload (peg in hole and slot assembly) and free-space co-manipulation of a 50 kg payload.
ROOct 24, 2021
Contact Information Flow and Design of ComplianceKevin Haninger, Marcel Radke, Richard Hartisch et al.
Identifying changes in contact during contact-rich manipulation can detect task state or errors, enabling improved robustness and autonomy. The ability to detect contact is affected by the mechatronic design of the robot, especially its physical compliance. Established methods can design physical compliance for many aspects of contact performance (e.g. peak contact force, motion/force control bandwidth), but are based on time-invariant dynamic models. A change in contact mode is a discrete change in coupled robot-environment dynamics, not easily considered in existing design methods. Towards designing robots which can robustly detect changes in contact mode online, this paper investigates how mechatronic design can improve contact estimation, with a focus on the impact of the location and degree of compliance. A design metric of information gain is proposed which measures how much position/force measurements reduce uncertainty in the contact mode estimate. This information gain is developed for fully- and partially-observed systems, as partial observability can arise from joint flexibility in the robot or environmental inertia. Hardware experiments with various compliant setups validate that information gain predicts the speed and certainty with which contact is detected in (i) monitoring of contact-rich assembly and (ii) collision detection.
CVJul 9, 2020
Uncertainty Quantification in Deep Residual Neural NetworksLukasz Wandzik, Raul Vicente Garcia, Jörg Krüger
Uncertainty quantification is an important and challenging problem in deep learning. Previous methods rely on dropout layers which are not present in modern deep architectures or batch normalization which is sensitive to batch sizes. In this work, we address the problem of uncertainty quantification in deep residual networks by using a regularization technique called stochastic depth. We show that training residual networks using stochastic depth can be interpreted as a variational approximation to the intractable posterior over the weights in Bayesian neural networks. We demonstrate that by sampling from a distribution of residual networks with varying depth and shared weights, meaningful uncertainty estimates can be obtained. Moreover, compared to the original formulation of residual networks, our method produces well-calibrated softmax probabilities with only minor changes to the network's structure. We evaluate our approach on popular computer vision datasets and measure the quality of uncertainty estimates. We also test the robustness to domain shift and show that our method is able to express higher predictive uncertainty on out-of-distribution samples. Finally, we demonstrate how the proposed approach could be used to obtain uncertainty estimates in facial verification applications.