CVApr 19, 2022
Performance Evaluation of Action Recognition Models on Low Quality VideosAoi Otani, Ryota Hashiguchi, Kazuki Omi et al.
In the design of action recognition models, the quality of videos is an important issue; however, the trade-off between the quality and performance is often ignored. In general, action recognition models are trained on high-quality videos, hence it is not known how the model performance degrades when tested on low-quality videos, and how much the quality of training videos affects the performance. The issue of video quality is important, however, it has not been studied so far. The goal of this study is to show the trade-off between the performance and the quality of training and test videos by quantitative performance evaluation of several action recognition models for transcoded videos in different qualities. First, we show how the video quality affects the performance of pre-trained models. We transcode the original validation videos of Kinetics400 by changing quality control parameters of JPEG (compression strength) and H.264/AVC (CRF). Then we use the transcoded videos to validate the pre-trained models. Second, we show how the models perform when trained on transcoded videos. We transcode the original training videos of Kinetics400 by changing the quality parameters of JPEG and H.264/AVC. Then we train the models on the transcoded training videos and validate them with the original and transcoded validation videos. Experimental results with JPEG transcoding show that there is no severe performance degradation (up to -1.5%) for compression strength smaller than 70 where no quality degradation is visually observed, and for larger than 80 the performance degrades linearly with respect to the quality index. Experiments with H.264/AVC transcoding show that there is no significant performance loss (up to -1%) with CRF30 while the total size of video files is reduced to 30%.
QMJun 10, 2025
Protriever: End-to-End Differentiable Protein Homology Search for Fitness PredictionRuben Weitzman, Peter Mørch Groth, Lood Van Niekerk et al.
Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training models on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions & deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture- and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time -- offering a scalable alternative to alignment-centric approaches.
LGSep 4, 2025
Mitigating Catastrophic Forgetting and Mode Collapse in Text-to-Image Diffusion via Latent ReplayAoi Otani
Continual learning -- the ability to acquire knowledge incrementally without forgetting previous skills -- is fundamental to natural intelligence. While the human brain excels at this, artificial neural networks struggle with "catastrophic forgetting," where learning new tasks erases previously acquired knowledge. This challenge is particularly severe for text-to-image diffusion models, which generate images from textual prompts. Additionally, these models face "mode collapse," where their outputs become increasingly repetitive over time. To address these challenges, we apply Latent Replay, a neuroscience-inspired approach, to diffusion models. Traditional replay methods mitigate forgetting by storing and revisiting past examples, typically requiring large collections of images. Latent Replay instead retains only compact, high-level feature representations extracted from the model's internal architecture. This mirrors the hippocampal process of storing neural activity patterns rather than raw sensory inputs, reducing memory usage while preserving critical information. Through experiments with five sequentially learned visual concepts, we demonstrate that Latent Replay significantly outperforms existing methods in maintaining model versatility. After learning all concepts, our approach retained 77.59% Image Alignment (IA) on the earliest concept, 14% higher than baseline methods, while maintaining diverse outputs. Surprisingly, random selection of stored latent examples outperforms similarity-based strategies. Our findings suggest that Latent Replay enables efficient continual learning for generative AI models, paving the way for personalized text-to-image models that evolve with user needs without excessive computational costs.