Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis
This work addresses the challenge of adopting proprietary models for semiconductor manufacturing by providing a secure, cost-effective, and customizable approach for microscopy image analysis, though it is incremental as it builds on existing multimodal methods.
The paper tackles the problem of analyzing electron microscopy images in semiconductor manufacturing by introducing a vision-language instruction tuning framework that uses a teacher-student approach with pre-trained models like GPT-4 to generate data for zero-shot visual question answering and classification tasks, resulting in a customized assistant that reduces the need for extensive human labeling.
We present a novel framework for analyzing and interpreting electron microscopy images in semiconductor manufacturing using vision-language instruction tuning. The framework employs a unique teacher-student approach, leveraging pre-trained multimodal large language models such as GPT-4 to generate instruction-following data for zero-shot visual question answering (VQA) and classification tasks, customizing smaller multimodal models (SMMs) for microscopy image analysis, resulting in an instruction-tuned language-and-vision assistant. Our framework merges knowledge engineering with machine learning to integrate domain-specific expertise from larger to smaller multimodal models within this specialized field, greatly reducing the need for extensive human labeling. Our study presents a secure, cost-effective, and customizable approach for analyzing microscopy images, addressing the challenges of adopting proprietary models in semiconductor manufacturing.