Manuel Knott

CV
h-index45
7papers
188citations
Novelty51%
AI Score39

7 Papers

CVJul 10, 2022
Facilitated machine learning for image-based fruit quality assessment

Manuel Knott, Fernando Perez-Cruz, Thijs Defraeye

Image-based machine learning models can be used to make the sorting and grading of agricultural products more efficient. In many regions, implementing such systems can be difficult due to the lack of centralization and automation of postharvest supply chains. Stakeholders are often too small to specialize in machine learning, and large training data sets are unavailable. We propose a machine learning procedure for images based on pre-trained Vision Transformers. It is easier to implement than the current standard approach of training Convolutional Neural Networks (CNNs) as we do not (re-)train deep neural networks. We evaluate our approach based on two data sets for apple defect detection and banana ripeness estimation. Our model achieves a competitive classification accuracy equal to or less than one percent below the best-performing CNN. At the same time, it requires three times fewer training samples to achieve a 90% accuracy.

CVSep 29, 2023Code
Text-image Alignment for Diffusion-based Perception

Neehar Kondapaneni, Markus Marks, Manuel Knott et al.

Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current state-of-the-art (SOTA) in diffusion-based semantic segmentation on ADE20K and the current overall SOTA for depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting. We use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our cross-domain object detection model, trained on Pascal VOC, achieves SOTA results on Watercolor2K. Our cross-domain segmentation method, trained on Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: https://www.vision.caltech.edu/tadp/. Code: https://github.com/damaggu/TADP.

CVJul 16, 2024
A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Markus Marks, Manuel Knott, Neehar Kondapaneni et al.

Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels. The model is forced to learn about the data structure or context by solving a pretext task. With SSL, models can learn from abundant and cheap unlabeled data, significantly reducing the cost of training models where labels are expensive or inaccessible. In Computer Vision, SSL is widely used as pre-training followed by a downstream task, such as supervised transfer, few-shot learning on smaller labeled data sets, and/or unsupervised clustering. Unfortunately, it is infeasible to evaluate SSL methods on all possible downstream tasks and objectively measure the quality of the learned representation. Instead, SSL methods are evaluated using in-domain evaluation protocols, such as fine-tuning, linear probing, and k-nearest neighbors (kNN). However, it is not well understood how well these evaluation protocols estimate the representation quality of a pre-trained model for different downstream tasks under different conditions, such as dataset, metric, and model architecture. We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types. Our study includes eleven common image datasets and 26 models that were pre-trained with different SSL methods or have different model backbones. We find that in-domain linear/kNN probing protocols are, on average, the best general predictors for out-of-domain performance. We further investigate the importance of batch normalization and evaluate how robust correlations are for different kinds of dataset domain shifts. We challenge assumptions about the relationship between discriminative and generative self-supervised methods, finding that most of their performance differences can be explained by changes to model backbones.

CVAug 26, 2024Code
Social Perception of Faces in a Vision-Language Model

Carina I. Hausladen, Manuel Knott, Colin F. Camerer et al.

We explore social perception of human faces in CLIP, a widely used open-source vision-language model. To this end, we compare the similarity in CLIP embeddings between different textual prompts and a set of face images. Our textual prompts are constructed from well-validated social psychology terms denoting social perception. The face images are synthetic and are systematically and independently varied along six dimensions: the legally protected attributes of age, gender, and race, as well as facial expression, lighting, and pose. Independently and systematically manipulating face attributes allows us to study the effect of each on social perception and avoids confounds that can occur in wild-collected data due to uncontrolled systematic correlations between attributes. Thus, our findings are experimental rather than observational. Our main findings are three. First, while CLIP is trained on the widest variety of images and texts, it is able to make fine-grained human-like social judgments on face images. Second, age, gender, and race do systematically impact CLIP's social perception of faces, suggesting an undesirable bias in CLIP vis-a-vis legally protected attributes. Most strikingly, we find a strong pattern of bias concerning the faces of Black women, where CLIP produces extreme values of social perception across different ages and facial expressions. Third, facial expression impacts social perception more than age and lighting as much as age. The last finding predicts that studies that do not control for unprotected visual attributes may reach the wrong conclusions on bias. Our novel method of investigation, which is founded on the social psychology literature and on the experiments involving the manipulation of individual attributes, yields sharper and more reliable observations than previous observational methods and may be applied to study biases in any vision-language model.

CVFeb 20, 2025Code
A Rapid Test for Accuracy and Bias of Face Recognition Technology

Manuel Knott, Ignacio Serna, Ethan Mann et al.

Measuring the accuracy of face recognition (FR) systems is essential for improving performance and ensuring responsible use. Accuracy is typically estimated using large annotated datasets, which are costly and difficult to obtain. We propose a novel method for 1:1 face verification that benchmarks FR systems quickly and without manual annotation, starting from approximate labels (e.g., from web search results). Unlike previous methods for training set label cleaning, ours leverages the embedding representation of the models being evaluated, achieving high accuracy in smaller-sized test datasets. Our approach reliably estimates FR accuracy and ranking, significantly reducing the time and cost of manual labeling. We also introduce the first public benchmark of five FR cloud services, revealing demographic biases, particularly lower accuracy for Asian women. Our rapid test method can democratize FR testing, promoting scrutiny and responsible use of the technology. Our method is provided as a publicly accessible tool at https://github.com/caltechvisionlab/frt-rapid-test

CVNov 25, 2024Code
Weakly Supervised Panoptic Segmentation for Defect-Based Grading of Fresh Produce

Manuel Knott, Divinefavour Odion, Sameer Sontakke et al.

Visual inspection for defect grading in agricultural supply chains is crucial but traditionally labor-intensive and error-prone. Automated computer vision methods typically require extensively annotated datasets, which are often unavailable in decentralized supply chains. We address this challenge by evaluating the Segment Anything Model (SAM) to generate dense panoptic segmentation masks from sparse annotations. These dense predictions are then used to train a supervised panoptic segmentation model. Focusing on banana surface defects (bruises and scars), we validate our approach using 476 field images annotated with 1440 defects. While SAM-generated masks generally align with human annotations, substantially reducing annotation effort, we explicitly identify failure cases associated with specific defect sizes and shapes. Despite these limitations, our approach offers practical estimates of defect number and relative size from panoptic masks, underscoring the potential and current boundaries of foundation models for defect quantification in low-data agricultural scenarios. GitHub: https://github.com/manuelknott/banana-defect-segmentation

LGJan 14, 2025
Gandalf the Red: Adaptive Security for LLMs

Niklas Pfister, Václav Volhejn, Manuel Knott et al.

Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.