IRJul 29, 2024
Graphite: A Graph-based Extreme Multi-Label Short Text Classifier for Keyphrase RecommendationAshirbad Mishra, Soumik Dey, Jinyu Zhao et al.
Keyphrase Recommendation has been a pivotal problem in advertising and e-commerce where advertisers/sellers are recommended keyphrases (search queries) to bid on to increase their sales. It is a challenging task due to the plethora of items shown on online platforms and various possible queries that users search while showing varying interest in the displayed items. Moreover, query/keyphrase recommendations need to be made in real-time and in a resource-constrained environment. This problem can be framed as an Extreme Multi-label (XML) Short text classification by tagging the input text with keywords as labels. Traditional neural network models are either infeasible or have slower inference latency due to large label spaces. We present Graphite, a graph-based classifier model that provides real-time keyphrase recommendations that are on par with standard text classification models. Furthermore, it doesn't utilize GPU resources, which can be limited in production environments. Due to its lightweight nature and smaller footprint, it can train on very large datasets, where state-of-the-art XML models fail due to extreme resource requirements. Graphite is deterministic, transparent, and intrinsically more interpretable than neural network-based models. We present a comprehensive analysis of our model's performance across forty categories spanning eBay's English-speaking sites.
IRSep 5, 2024
GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase RecommendationAshirbad Mishra, Soumik Dey, Marshall Wu et al.
Online sellers and advertisers are recommended keyphrases for their listed products, which they bid on to enhance their sales. One popular paradigm that generates such recommendations is Extreme Multi-Label Classification (XMC), which involves tagging/mapping keyphrases to items. We outline the limitations of using traditional item-query based tagging or mapping techniques for keyphrase recommendations on E-Commerce platforms. We introduce GraphEx, an innovative graph-based approach that recommends keyphrases to sellers using extraction of token permutations from item titles. Additionally, we demonstrate that relying on traditional metrics such as precision/recall can be misleading in practical applications, thereby necessitating a combination of metrics to evaluate performance in real-world scenarios. These metrics are designed to assess the relevance of keyphrases to items and the potential for buyer outreach. GraphEx outperforms production models at eBay, achieving the objectives mentioned above. It supports near real-time inferencing in resource-constrained production environments and scales effectively for billions of items.
CVFeb 10
ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video GenerationMingyang Wu, Ashirbad Mishra, Soumik Dey et al.
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.
IRMay 25, 2025
BroadGen: A Framework for Generating Effective and Efficient Advertiser Broad Match Keyphrase RecommendationsAshirbad Mishra, Jinyu Zhao, Soumik Dey et al.
In the domain of sponsored search advertising, the focus of Keyphrase recommendation has largely been on exact match types, which pose issues such as high management expenses, limited targeting scope, and evolving search query patterns. Alternatives like Broad match types can alleviate certain drawbacks of exact matches but present challenges like poor targeting accuracy and minimal supervisory signals owing to limited advertiser usage. This research defines the criteria for an ideal broad match, emphasizing on both efficiency and effectiveness, ensuring that a significant portion of matched queries are relevant. We propose BroadGen, an innovative framework that recommends efficient and effective broad match keyphrases by utilizing historical search query data. Additionally, we demonstrate that BroadGen, through token correspondence modeling, maintains better query stability over time. BroadGen's capabilities allow it to serve daily, millions of sellers at eBay with over 2.5 billion items.
CVOct 9, 2025
Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact LocalizationShuo Xing, Soumik Dey, Mingyang Wu et al.
Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.