85.3LGApr 3
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM MidtrainingBingbing Wen, Sirajul Salekin, Feiyang Kang et al.
Domain reweighting can improve sample efficiency and downstream generalization, but data-mixture optimization for multimodal midtraining remains largely unexplored. Current multimodal training recipes tune mixtures along a single dimension, typically data format or task type. We introduce MixAtlas, a method that produces benchmark-targeted data recipes that can be inspected, adapted, and transferred to new corpora. MixAtlas decomposes the training corpus along two axes: image concepts (10 visual-domain clusters discovered via CLIP embeddings) and task supervision (5 objective types including captioning, OCR, grounding, detection, and VQA). Using small proxy models (Qwen2-0.5B) paired with a Gaussian-process surrogate and GP-UCB acquisition, MixAtlas searches the resulting mixture space with the same proxy budget as regression-based baselines but finds better-performing mixtures. We evaluate on 10 benchmarks spanning visual understanding, document reasoning, and multimodal reasoning. On Qwen2-7B, optimized mixtures improve average performance by 8.5%-17.6% over the strongest baseline; on Qwen2.5-7B, gains are 1.0%-3.3%. Both settings reach baseline-equivalent training loss in up to 2 times fewer steps. Recipes discovered on 0.5B proxies transfer to 7B-scale training across Qwen model families.
91.6CVMar 10
RubiCap: Rubric-Guided Reinforcement Learning for Dense Image CaptioningTzu-Heng Huang, Sirajul Salekin, Javier Movellan et al.
Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.
CVJun 23, 2021Code
Probabilistic Attention for Interactive SegmentationPrasad Gabbur, Manjot Bilkhu, Javier Movellan
We provide a probabilistic interpretation of attention and show that the standard dot-product attention in transformers is a special case of Maximum A Posteriori (MAP) inference. The proposed approach suggests the use of Expectation Maximization algorithms for online adaptation of key and value model parameters. This approach is useful for cases in which external agents, e.g., annotators, provide inference-time information about the correct values of some tokens, e.g, the semantic category of some pixels, and we need for this new information to propagate to other tokens in a principled manner. We illustrate the approach on an interactive semantic segmentation task in which annotators and models collaborate online to improve annotation efficiency. Using standard benchmarks, we observe that key adaptation boosts model performance ($\sim10\%$ mIoU) in the low feedback regime and value propagation improves model responsiveness in the high feedback regime. A PyTorch layer implementation of our probabilistic attention model will be made publicly available here: https://github.com/apple/ml-probabilistic-attention.
LGJul 17, 2025
Apple Intelligence Foundation Language Models: Tech Report 2025Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.
LGJan 12, 2025
Evaluating Sample Utility for Efficient Data Selection by Mimicking Model WeightsTzu-Heng Huang, Manjot Bilkhu, John Cooper et al.
Multimodal models are trained on large-scale web-crawled datasets, which often contain noise, bias, and irrelevant information. This motivates the use of data selection techniques, which can be divided into model-free variants, relying on heuristic rules and downstream datasets, and model-based approaches, such as those using influence functions. The former can be expensive to design and risks introducing unwanted dataset dependencies, while the latter are often computationally prohibitive. In this work, we propose an efficient, model-based approach using the Mimic Score, a new data-quality metric that leverages the weights of a reference model to assess the usefulness of individual samples for training a new model. Our method relies on measuring alignments between training gradients and a target direction induced by this reference model. Building on the derived mimic scores, we develop Grad-Mimic: a framework that prioritizes samples to learn, estimates overall sample utility, and creates effective filters. Empirically, using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Moreover, mimic score-based filters complement existing filtering methods, e.g., training improved CLIP models with 4.7 million fewer samples while offering accurate estimation of dataset quality.
CVJun 6, 2019
Attention is all you need for Videos: Self-attention based Video Summarization using Universal TransformersManjot Bilkhu, Siyang Wang, Tushar Dobhal
Video Captioning and Summarization have become very popular in the recent years due to advancements in Sequence Modelling, with the resurgence of Long-Short Term Memory networks (LSTMs) and introduction of Gated Recurrent Units (GRUs). Existing architectures extract spatio-temporal features using CNNs and utilize either GRUs or LSTMs to model dependencies with soft attention layers. These attention layers do help in attending to the most prominent features and improve upon the recurrent units, however, these models suffer from the inherent drawbacks of the recurrent units themselves. The introduction of the Transformer model has driven the Sequence Modelling field into a new direction. In this project, we implement a Transformer-based model for Video captioning, utilizing 3D CNN architectures like C3D and Two-stream I3D for video extraction. We also apply certain dimensionality reduction techniques so as to keep the overall size of the model within limits. We finally present our results on the MSVD and ActivityNet datasets for Single and Dense video captioning tasks respectively.
CVMar 18, 2019
Human Activity Recognition for Edge DevicesManjot Bilkhu, Hammababdullah Ayyubi
Video activity Recognition has recently gained a lot of momentum with the release of massive Kinetics (400 and 600) data. Architectures such as I3D and C3D networks have shown state-of-the-art performances for activity recognition. The one major pitfall with these state-of-the-art networks is that they require a lot of compute. In this paper we explore how we can achieve comparable results to these state-of-the-art networks for devices-on-edge. We primarily explore two architectures - I3D and Temporal Segment Network. We show that comparable results can be achieved using one tenth the memory usage by changing the testing procedure. We also report our results on Resnet architecture as our backbone apart from the original Inception architecture. Specifically, we achieve 84.54\% top-1 accuracy on UCF-101 dataset using only RGB frames.