CVLGFeb 9, 2025

Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces

arXiv:2502.05756v1h-index: 13
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of detecting fraudulent or illegal activities in online auto parts marketplaces, but it is incremental as it builds on existing ViT methods without major innovations.

This study evaluated Vision Transformers (ViT) for generating visual embeddings from auto part images on online marketplaces to detect illicit activities, finding strengths in isolating visual patterns but facing challenges like overlapping clusters and outliers.

This study examines the capabilities of the Vision Transformer (ViT) model in generating visual embeddings for images of auto parts sourced from online marketplaces, such as Craigslist and OfferUp. By focusing exclusively on single-modality data, the analysis evaluates ViT's potential for detecting patterns indicative of illicit activities. The workflow involves extracting high-dimensional embeddings from images, applying dimensionality reduction techniques like Uniform Manifold Approximation and Projection (UMAP) to visualize the embedding space, and using K-Means clustering to categorize similar items. Representative posts nearest to each cluster centroid provide insights into the composition and characteristics of the clusters. While the results highlight the strengths of ViT in isolating visual patterns, challenges such as overlapping clusters and outliers underscore the limitations of single-modal approaches in this domain. This work contributes to understanding the role of Vision Transformers in analyzing online marketplaces and offers a foundation for future advancements in detecting fraudulent or illegal activities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes