CLAug 25, 2024Code
DHP Benchmark: Are LLMs Good NLG Evaluators?Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang et al. · amazon-science, microsoft-research
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks; this is often referred to as ``LLM-as-a-judge'' paradigm. However, the capabilities of LLMs in evaluating NLG quality remain underexplored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs. This framework leverages hierarchically perturbed text data and statistical tests to systematically measure the NLG evaluation capabilities of LLMs. We re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM families provides critical insight into their strengths and limitations as NLG evaluators. Our dataset is available at https://huggingface.co/datasets/YCWANGVINCE/DHP_Benchmark.
LGFeb 1
LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task LearningMd Kowsher, Haris Mansoor, Nusrat Jahan Prottasha et al.
MoE-PEFT methods combine Mixture of Experts with parameter-efficient fine-tuning for multi-task adaptation, but require separate adapters per expert causing trainable parameters to scale linearly with expert count and limiting applicability to adapter-based architectures. We propose LiME (Lightweight Mixture of Experts), which achieves expert specialization through lightweight modulation rather than adapter replication. Instead of separate adapters, LiME uses a single shared PEFT module and modulates its output with lightweight expert vectors, reducing expert parameters while generalizing to any PEFT method. Notably, LiME introduces zero-parameter routing by leveraging existing frozen and adapted representations eliminating learned router parameters typically required per layer. Theoretically, we prove that (i) more experts preserve more task-relevant information and (ii) modulation approximates full expert-specific PEFT with bounded error. LiME further incorporates n-gram windowed routing and adaptive expert selection (Auto Top-K) based on routing confidence. Experiments on MMT-47, a multimodal multi-task benchmark with 47 tasks spanning text, image, and video, demonstrate that LiME achieves competitive or superior performance while using up to 4x fewer trainable parameters and up to 29% faster training compared to corresponding MoE-PEFT baselines.
ASMay 21, 2024Code
FairLENS: Assessing Fairness in Law Enforcement Speech RecognitionYicheng Wang, Mark Cusick, Mohamed Laila et al.
Automatic speech recognition (ASR) techniques have become powerful tools, enhancing efficiency in law enforcement scenarios. To ensure fairness for demographic groups in different acoustic environments, ASR engines must be tested across a variety of speakers in realistic settings. However, describing the fairness discrepancies between models with confidence remains a challenge. Meanwhile, most public ASR datasets are insufficient to perform a satisfying fairness evaluation. To address the limitations, we built FairLENS - a systematic fairness evaluation framework. We propose a novel and adaptable evaluation method to examine the fairness disparity between different models. We also collected a fairness evaluation dataset covering multiple scenarios and demographic dimensions. Leveraging this framework, we conducted fairness assessments on 1 open-source and 11 commercially available state-of-the-art ASR models. Our results reveal that certain models exhibit more biases than others, serving as a fairness guideline for users to make informed choices when selecting ASR models for a given real-world scenario. We further explored model biases towards specific demographic groups and observed that shifts in the acoustic domain can lead to the emergence of new biases.
CVApr 10
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding ParadigmsShengji Jin, Yuanhao Zou, Victor Zhu et al.
While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead. Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead. These findings provide objective empirical guidelines for designing efficient, deployment-ready VTG systems.
CLFeb 11, 2025
Auto-Drafting Police Reports from Noisy ASR Outputs: A Trust-Centered LLM ApproachParam Kulkarni, Yingchi Liu, Hao-Ming Fu et al.
Achieving a delicate balance between fostering trust in law enforcement and protecting the rights of both officers and civilians continues to emerge as a pressing research and product challenge in the world today. In the pursuit of fairness and transparency, this study presents an innovative AI-driven system designed to generate police report drafts from complex, noisy, and multi-role dialogue data. Our approach intelligently extracts key elements of law enforcement interactions and includes them in the draft, producing structured narratives that are not only high in quality but also reinforce accountability and procedural clarity. This framework holds the potential to transform the reporting process, ensuring greater oversight, consistency, and fairness in future policing practices. A demonstration video of our system can be accessed at https://drive.google.com/file/d/1kBrsGGR8e3B5xPSblrchRGj-Y-kpCHNO/view?usp=sharing
CVDec 13, 2024
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual EditingUmar Khalid, Kashif Munir, Hasan Iqbal et al.
Editing complex visual content from ambiguous or partially specified instructions remains a core challenge in vision-language modeling. Existing models can contextualize content but often fail to infer the underlying intent within a reference image or scene, leading to inconsistent or misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts. EVLM's key innovation is a reflective reasoning framework that translates subjective user intent into structured, actionable outputs by aligning with human-rated rationales through Reflection-Aware KL-Divergence Target Optimization (RKTO). By combining Chain-of-Thought (CoT) reasoning with RKTO alignment, EVLM captures fine-grained editing preferences without relying on binary supervision. Trained on a dataset of 30,000 CoT examples with human-annotated rationale quality, EVLM achieves substantial gains in alignment with human intent. Experiments across image, video, 3D, and 4D editing tasks show that EVLM generates coherent and high-quality instructions, providing a scalable foundation for multimodal editing and reasoning.