CVAug 13, 2024
Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse WorkplacesZhiling Chen, Hanning Chen, Mohsen Imani et al.
Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety items, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges in consistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The scene recognition identifies the current scenario to determine the necessary safety gear. The visual prompt formulates the specific visual prompts needed for the detection process. The safety items detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, the fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.
64.4ROMay 5
Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional ComputingZhiling Chen, David Gorsich, Matthew P. Castanier et al.
Robotic laser profiling is widely used for dimensional verification and surface inspection, yet measurement fidelity is often dominated by sensor configuration rather than robot motion. Industrial profilers expose multiple coupled parameters, including sampling frequency, measurement range, exposure time, receiver dynamic range, and illumination, that are still tuned by trial-and-error; mismatches can cause saturation, clipping, or missing returns that cannot be recovered downstream. We formulate instruction-conditioned sensing parameter recommendation; given a pre-scan RGB observation and a natural-language inspection instruction, infer a discrete configuration over key parameters of a robot-mounted profiler. To benchmark this problem, we develop Instruct-Obs2Param, a real-world multimodal dataset linking inspection intents and multi-view pose and illumination variation across 16 objects to canonical parameter regimes. We then propose ScanHD, a hyperdimensional computing framework that binds instruction and observation into a task-aware code and performs parameter-wise associative reasoning with compact memories, matching discrete scanner regimes while yielding stable, interpretable, low-latency decisions. On Instruct-Obs2Param, ScanHD achieves 92.7% average exact accuracy and 98.1% average Win@1 accuracy across the five parameters, with strong cross-split generalization and low-latency inference suitable for deployment, outperforming rule-based heuristics, conventional multimodal models, and multimodal large language models. This work enables autonomous, instruction-conditioned sensing configuration from task intent and scene context, eliminating manual tuning and elevating sensor configuration from a static setting to an adaptive decision variable.
CVJan 27, 2025
Can Multimodal Large Language Models be Guided to Improve Industrial Anomaly Detection?Zhiling Chen, Hanning Chen, Mohsen Imani et al.
In industrial settings, the accurate detection of anomalies is essential for maintaining product quality and ensuring operational safety. Traditional industrial anomaly detection (IAD) models often struggle with flexibility and adaptability, especially in dynamic production environments where new defect types and operational changes frequently arise. Recent advancements in Multimodal Large Language Models (MLLMs) hold promise for overcoming these limitations by combining visual and textual information processing capabilities. MLLMs excel in general visual understanding due to their training on large, diverse datasets, but they lack domain-specific knowledge, such as industry-specific defect tolerance levels, which limits their effectiveness in IAD tasks. To address these challenges, we propose Echo, a novel multi-expert framework designed to enhance MLLM performance for IAD. Echo integrates four expert modules: Reference Extractor which provides a contextual baseline by retrieving similar normal images, Knowledge Guide which supplies domain-specific insights, Reasoning Expert which enables structured, stepwise reasoning for complex queries, and Decision Maker which synthesizes information from all modules to deliver precise, context-aware responses. Evaluated on the MMAD benchmark, Echo demonstrates significant improvements in adaptability, precision, and robustness, moving closer to meeting the demands of real-world industrial anomaly detection.
LGNov 2, 2024
Privacy-Preserving Federated Learning with Differentially Private Hyperdimensional ComputingFardin Jalil Piran, Zhiling Chen, Mohsen Imani et al.
Federated Learning (FL) has become a key method for preserving data privacy in Internet of Things (IoT) environments, as it trains Machine Learning (ML) models locally while transmitting only model updates. Despite this design, FL remains susceptible to threats such as model inversion and membership inference attacks, which can reveal private training data. Differential Privacy (DP) techniques are often introduced to mitigate these risks, but simply injecting DP noise into black-box ML models can compromise accuracy, particularly in dynamic IoT contexts, where continuous, lifelong learning leads to excessive noise accumulation. To address this challenge, we propose Federated HyperDimensional computing with Privacy-preserving (FedHDPrivacy), an eXplainable Artificial Intelligence (XAI) framework that integrates neuro-symbolic computing and DP. Unlike conventional approaches, FedHDPrivacy actively monitors the cumulative noise across learning rounds and adds only the additional noise required to satisfy privacy constraints. In a real-world application for monitoring manufacturing machining processes, FedHDPrivacy maintains high performance while surpassing standard FL frameworks - Federated Averaging (FedAvg), Federated Proximal (FedProx), Federated Normalized Averaging (FedNova), and Federated Optimization (FedOpt) - by up to 37%. Looking ahead, FedHDPrivacy offers a promising avenue for further enhancements, such as incorporating multimodal data fusion.
CRSep 12, 2025
Privacy-Preserving Decentralized Federated Learning via Explainable Adaptive Differential PrivacyFardin Jalil Piran, Zhiling Chen, Yang Zhang et al.
Decentralized federated learning faces privacy risks because model updates can leak data through inference attacks and membership inference, a concern that grows over many client exchanges. Differential privacy offers principled protection by injecting calibrated noise so confidential information remains secure on resource-limited IoT devices. Yet without transparency, black-box training cannot track noise already injected by previous clients and rounds, which forces worst-case additions and harms accuracy. We propose PrivateDFL, an explainable framework that joins hyperdimensional computing with differential privacy and keeps an auditable account of cumulative noise so each client adds only the difference between the required noise and what has already been accumulated. We evaluate on MNIST, ISOLET, and UCI-HAR to span image, signal, and tabular modalities, and we benchmark against transformer-based and deep learning-based baselines trained centrally with Differentially Private Stochastic Gradient Descent (DP-SGD) and Renyi Differential Privacy (RDP). PrivateDFL delivers higher accuracy, lower latency, and lower energy across IID and non-IID partitions while preserving formal (epsilon, delta) guarantees and operating without a central server. For example, under non-IID partitions, PrivateDFL achieves 24.42% higher accuracy than the Vision Transformer on MNIST while using about 10x less training time, 76x lower inference latency, and 11x less energy, and on ISOLET it exceeds Transformer accuracy by more than 80% with roughly 10x less training time, 40x lower inference latency, and 36x less training energy. Future work will extend the explainable accounting to adversarial clients and adaptive topologies with heterogeneous privacy budgets.
AIMay 20, 2025
Multimodal RAG-driven Anomaly Detection and Classification in Laser Powder Bed Fusion using Large Language ModelsKiarash Naghavi Khanghah, Zhiling Chen, Lela Romeo et al.
Additive manufacturing enables the fabrication of complex designs while minimizing waste, but faces challenges related to defects and process anomalies. This study presents a novel multimodal Retrieval-Augmented Generation-based framework that automates anomaly detection across various Additive Manufacturing processes leveraging retrieved information from literature, including images and descriptive text, rather than training datasets. This framework integrates text and image retrieval from scientific literature and multimodal generation models to perform zero-shot anomaly identification, classification, and explanation generation in a Laser Powder Bed Fusion setting. The proposed framework is evaluated on four L-PBF manufacturing datasets from Oak Ridge National Laboratory, featuring various printer makes, models, and materials. This evaluation demonstrates the framework's adaptability and generalizability across diverse images without requiring additional training. Comparative analysis using Qwen2-VL-2B and GPT-4o-mini as MLLM within the proposed framework highlights that GPT-4o-mini outperforms Qwen2-VL-2B and proportional random baseline in manufacturing anomalies classification. Additionally, the evaluation of the RAG system confirms that incorporating retrieval mechanisms improves average accuracy by 12% by reducing the risk of hallucination and providing additional information. The proposed framework can be continuously updated by integrating emerging research, allowing seamless adaptation to the evolving landscape of AM technologies. This scalable, automated, and zero-shot-capable framework streamlines AM anomaly analysis, enhancing efficiency and accuracy.