AIMay 26Code
Laguna M.1/XS.2 Technical ReportJulien Abadji, Marah Abdin, Connor Adams et al.
We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.
CVMar 4
InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot TransitionsMohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen et al. · allen-ai
Generating long-form storytelling videos with consistent visual narratives remains a significant challenge in video synthesis. We present a novel framework, dataset, and a model that address three critical limitations: background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives. Our approach introduces a background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships. We further propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames, going beyond the single-subject limitations of prior work. To support this, we contribute with a synthetic dataset of 10,000 multi-subject transition sequences covering underrepresented dynamic scene compositions. On VBench, InfinityStory achieves the highest Background Consistency (88.94), highest Subject Consistency (82.11), and the best overall average rank (2.80), showing improved stability, smoother transitions, and better temporal coherence.
SDOct 21, 2022
Optimizing Bilingual Neural Transducer with Synthetic Code-switching Text GenerationThien Nguyen, Nathalie Tran, Liuhui Deng et al.
Code-switching describes the practice of using more than one language in the same sentence. In this study, we investigate how to optimize a neural transducer based bilingual automatic speech recognition (ASR) model for code-switching speech. Focusing on the scenario where the ASR model is trained without supervised code-switching data, we found that semi-supervised training and synthetic code-switched data can improve the bilingual ASR system on code-switching speech. We analyze how each of the neural transducer's encoders contributes towards code-switching performance by measuring encoder-specific recall values, and evaluate our English/Mandarin system on the ASCEND data set. Our final system achieves 25% mixed error rate (MER) on the ASCEND English/Mandarin code-switching test set -- reducing the MER by 2.1% absolute compared to the previous literature -- while maintaining good accuracy on the monolingual test sets.
DCMar 11
Multi-GPU Quantum Circuit Simulation and the Impact of Network PerformanceW. Michael Brown, Anurag Ramesh, Thomas Lubinski et al.
As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement for algorithm development and validation, as well as hardware design. GPU-acceleration has become standard practice for simulation, and due to the exponential scaling inherent in classical methods, multi-GPU simulation can be required to achieve representative system sizes. In this case, inter-GPU communications can bottleneck performance. In this work, we present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems. We review the advances in interconnect technology and the APIs for multi-GPU communication. We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture that represents the first product to expand high-bandwidth GPU-specialized interconnects across multiple nodes. We show that while improvements to GPU architecture have led to speedups of over 4.5X across the last few generations of GPUs, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution for multi-GPU simulations.
CVSep 17, 2025Code
Semi-MoE: Mixture-of-Experts meets Semi-Supervised Histopathology SegmentationNguyen Lan Vi Vu, Thanh-Huy Nguyen, Thien Nguyen et al.
Semi-supervised learning has been employed to alleviate the need for extensive labeled data for histopathology image segmentation, but existing methods struggle with noisy pseudo-labels due to ambiguous gland boundaries and morphological misclassification. This paper introduces Semi-MOE, to the best of our knowledge, the first multi-task Mixture-of-Experts framework for semi-supervised histopathology image segmentation. Our approach leverages three specialized expert networks: A main segmentation expert, a signed distance field regression expert, and a boundary prediction expert, each dedicated to capturing distinct morphological features. Subsequently, the Multi-Gating Pseudo-labeling module dynamically aggregates expert features, enabling a robust fuse-and-refine pseudo-labeling mechanism. Furthermore, to eliminate manual tuning while dynamically balancing multiple learning objectives, we propose an Adaptive Multi-Objective Loss. Extensive experiments on GlaS and CRAG benchmarks show that our method outperforms state-of-the-art approaches in low-label settings, highlighting the potential of MoE-based architectures in advancing semi-supervised segmentation. Our code is available at https://github.com/vnlvi2k3/Semi-MoE.
CVApr 13
Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation ModelsNhan Ho, Luu Le, Thanh-Huy Nguyen et al.
Occlusion, where target structures are partially hidden by surgical instruments or overlapping tissues, remains a critical yet underexplored challenge for foundation segmentation models in clinical endoscopy. We introduce OccSAM-Bench, a benchmark designed to systematically evaluate SAM-family models under controlled, synthesized surgical occlusion. Our framework simulates two occlusion types (i.e., surgical tool overlay and cutout) across three calibrated severity levels on three public polyp datasets. We propose a novel three-region evaluation protocol that decomposes segmentation performance into full, visible-only, and invisible targets. This metric exposes behaviors that standard amodal evaluation obscures, revealing two distinct model archetypes: Occluder-Aware models (SAM, SAM 2, SAM 3, MedSAM3), which prioritize visible tissue delineation and reject instruments, and Occluder-Agnostic models (MedSAM, MedSAM2), which confidently predict into occluded regions. SAM-Med2D aligns with neither and underperforms across all conditions. Ultimately, our results demonstrate that occlusion robustness is not uniform across architectures, and model selection must be driven by specific clinical intent-whether prioritizing conservative visible-tissue segmentation or the amodal inference of hidden anatomy.
AIApr 2
ByteRover: Agent-Native Memory Through LLM-Curated Hierarchical ContextAndy Nguyen, Danh Doan, Hoang Pham et al.
Memory-Augmented Generation (MAG) extends large language models with external memory to support long-context reasoning, but existing approaches universally treat memory as an external service that agents call into, delegating storage to separate pipelines of chunking, embedding, and graph extraction. This architectural separation means the system that stores knowledge does not understand it, leading to semantic drift between what the agent intended to remember and what the pipeline actually captured, loss of coordination context across agents, and fragile recovery after failures. In this paper, we propose ByteRover, an agent-native memory architecture that inverts the memory pipeline: the same LLM that reasons about a task also curates, structures, and retrieves knowledge. ByteRover represents knowledge in a hierarchical Context Tree, a file-based knowledge graph organized as Domain, Topic, Subtopic, and Entry, where each entry carries explicit relations, provenance, and an Adaptive Knowledge Lifecycle (AKL) with importance scoring, maturity tiers, and recency decay. Retrieval uses a 5-tier progressive strategy that resolves most queries at sub-100 ms latency without LLM calls, escalating to agentic reasoning only for novel questions. Experiments on LoCoMo and LongMemEval demonstrate that ByteRover achieves state-of-the-art accuracy on LoCoMo and competitive results on LongMemEval while requiring zero external infrastructure, no vector database, no graph database, no embedding service, with all knowledge stored as human-readable markdown files on the local filesystem.
CVDec 10, 2025
Modality-Specific Enhancement and Complementary Fusion for Semi-Supervised Multi-Modal Brain Tumor SegmentationTien-Dat Chung, Ba-Thinh Lam, Thanh-Huy Nguyen et al.
Semi-supervised learning (SSL) has become a promising direction for medical image segmentation, enabling models to learn from limited labeled data alongside abundant unlabeled samples. However, existing SSL approaches for multi-modal medical imaging often struggle to exploit the complementary information between modalities due to semantic discrepancies and misalignment across MRI sequences. To address this, we propose a novel semi-supervised multi-modal framework that explicitly enhances modality-specific representations and facilitates adaptive cross-modal information fusion. Specifically, we introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic cues unique to each modality via channel-wise attention, and a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities. The overall framework is optimized using a hybrid objective combining supervised segmentation loss and cross-modal consistency regularization on unlabeled data. Extensive experiments on the BraTS 2019 (HGG subset) demonstrate that our method consistently outperforms strong semi-supervised and multi-modal baselines under 1\%, 5\%, and 10\% labeled data settings, achieving significant improvements in both Dice and Sensitivity scores. Ablation studies further confirm the complementary effects of our proposed MEM and CIF in bridging cross-modality discrepancies and improving segmentation robustness under scarce supervision.
SEOct 16, 2024
Mastering the Craft of Data Synthesis for CodeLLMsMeng Chen, Philip Arthur, Qianyu Feng et al.
Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.
CVJul 9, 2025
A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic QualityMohamed Elmoghany, Ryan Rossi, Seunghyun Yoon et al.
Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled "long-form videos". Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.
LGMay 5, 2025
End-to-end fully-binarized network design: from Generic Learned Thermometer to Block PruningThien Nguyen, William Guicquero
Existing works on Binary Neural Network (BNN) mainly focus on model's weights and activations while discarding considerations on the input raw data. This article introduces Generic Learned Thermometer (GLT), an encoding technique to improve input data representation for BNN, relying on learning non linear quantization thresholds. This technique consists in multiple data binarizations which can advantageously replace a conventional Analog to Digital Conversion (ADC) that uses natural binary coding. Additionally, we jointly propose a compact topology with light-weight grouped convolutions being trained thanks to block pruning and Knowledge Distillation (KD), aiming at reducing furthermore the model size so as its computational complexity. We show that GLT brings versatility to the BNN by intrinsically performing global tone mapping, enabling significant accuracy gains in practice (demonstrated by simulations on the STL-10 and VWW datasets). Moreover, when combining GLT with our proposed block-pruning technique, we successfully achieve lightweight (under 1Mb), fully-binarized models with limited accuracy degradation while being suitable for in-sensor always-on inference use cases.
LGFeb 23, 2025
Few-shot Continual Relation Extraction via Open Information ExtractionThiem Nguyen, Anh Nguyen, Quyen Tran et al.
Typically, Few-shot Continual Relation Extraction (FCRE) models must balance retaining prior knowledge while adapting to new tasks with extremely limited data. However, real-world scenarios may also involve unseen or undetermined relations that existing methods still struggle to handle. To address these challenges, we propose a novel approach that leverages the Open Information Extraction concept of Knowledge Graph Construction (KGC). Our method not only exposes models to all possible pairs of relations, including determined and undetermined labels not available in the training set, but also enriches model knowledge with diverse relation descriptions, thereby enhancing knowledge retention and adaptability in this challenging scenario. In the perspective of KGC, this is the first work explored in the setting of Continual Learning, allowing efficient expansion of the graph as the data evolves. Experimental results demonstrate our superior performance compared to other state-of-the-art FCRE baselines, as well as the efficiency in handling dynamic graph construction in this setting.
LGJun 19, 2024
Machine Learning Applications of Quantum Computing: A ReviewThien Nguyen, Tuomo Sipola, Jari Hautamäki
At the intersection of quantum computing and machine learning, this review paper explores the transformative impact these technologies are having on the capabilities of data processing and analysis, far surpassing the bounds of traditional computational methods. Drawing upon an in-depth analysis of 32 seminal papers, this review delves into the interplay between quantum computing and machine learning, focusing on transcending the limitations of classical computing in advanced data processing and applications. This review emphasizes the potential of quantum-enhanced methods in enhancing cybersecurity, a critical sector that stands to benefit significantly from these advancements. The literature review, primarily leveraging Science Direct as an academic database, delves into the transformative effects of quantum technologies on machine learning, drawing insights from a diverse collection of studies and scholarly articles. While the focus is primarily on the growing significance of quantum computing in cybersecurity, the review also acknowledges the promising implications for other sectors as the field matures. Our systematic approach categorizes sources based on quantum machine learning algorithms, applications, challenges, and potential future developments, uncovering that quantum computing is increasingly being implemented in practical machine learning scenarios. The review highlights advancements in quantum-enhanced machine learning algorithms and their potential applications in sectors such as cybersecurity, emphasizing the need for industry-specific solutions while considering ethical and security concerns. By presenting an overview of the current state and projecting future directions, the paper sets a foundation for ongoing research and strategic advancement in quantum machine learning.