51.5CLMay 31Code
MiCU: End-to-End Smart Home Command Understanding with Large Language ModelHaowei Han, Kexin Hu, Weiwei Cai et al.
Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi-research/iot_spec_llm
CVAug 26, 2024
DynaSurfGS: Dynamic Surface Reconstruction with Planar-based Gaussian SplattingWeiwei Cai, Weicai Ye, Peng Ye et al.
Dynamic scene reconstruction has garnered significant attention in recent years due to its capabilities in high-quality and real-time rendering. Among various methodologies, constructing a 4D spatial-temporal representation, such as 4D-GS, has gained popularity for its high-quality rendered images. However, these methods often produce suboptimal surfaces, as the discrete 3D Gaussian point clouds fail to align with the object's surface precisely. To address this problem, we propose DynaSurfGS to achieve both photorealistic rendering and high-fidelity surface reconstruction of dynamic scenarios. Specifically, the DynaSurfGS framework first incorporates Gaussian features from 4D neural voxels with the planar-based Gaussian Splatting to facilitate precise surface reconstruction. It leverages normal regularization to enforce the smoothness of the surface of dynamic objects. It also incorporates the as-rigid-as-possible (ARAP) constraint to maintain the approximate rigidity of local neighborhoods of 3D Gaussians between timesteps and ensure that adjacent 3D Gaussians remain closely aligned throughout. Extensive experiments demonstrate that DynaSurfGS surpasses state-of-the-art methods in both high-fidelity surface reconstruction and photorealistic rendering.
CVMay 12, 2025Code
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D AssetsWeiyu Li, Xuanyang Zhang, Zheng Sun et al.
While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.
CVMay 30, 2025Code
ViStoryBench: Comprehensive Benchmark Suite for Story VisualizationCailin Zhuang, Ailin Huang, Wei Cheng et al.
Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, no character reference, or single-image cases, and fall short of real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present ViStoryBench, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs verified by humans to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt adherence, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a high-fidelity, multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.
FLU-DYNSep 23, 2024
Neural refractive index field: Unlocking the Potential of Background-oriented Schlieren Tomography in Volumetric Flow VisualizationYuanzhe He, Yutao Zheng, Shijie Xu et al.
Background-oriented Schlieren tomography (BOST) is a prevalent method for visualizing intricate turbulent flows, valued for its ease of implementation and capacity to capture three-dimensional distributions of a multitude of flow parameters. However, the voxel-based meshing scheme leads to significant challenges, such as inadequate spatial resolution, substantial discretization errors, poor noise immunity, and excessive computational costs. This work presents an innovative reconstruction approach termed neural refractive index field (NeRIF) which implicitly represents the flow field with a neural network, which is trained with tailored strategies. Both numerical simulations and experimental demonstrations on turbulent Bunsen flames suggest that our approach can significantly improve the reconstruction accuracy and spatial resolution while concurrently reducing computational expenses. Although showcased in the context of background-oriented schlieren tomography here, the key idea embedded in the NeRIF can be readily adapted to various other tomographic modalities including tomographic absorption spectroscopy and tomographic particle imaging velocimetry, broadening its potential impact across different domains of flow visualization and analysis.
CVDec 4, 2019Code
PiiGAN: Generative Adversarial Networks for Pluralistic Image InpaintingWeiwei Cai, Zhanguo Wei
The latest methods based on deep learning have achieved amazing results regarding the complex work of inpainting large missing areas in an image. But this type of method generally attempts to generate one single "optimal" result, ignoring many other plausible results. Considering the uncertainty of the inpainting task, one sole result can hardly be regarded as a desired regeneration of the missing area. In view of this weakness, which is related to the design of the previous algorithms, we propose a novel deep generative model equipped with a brand new style extractor which can extract the style feature (latent vector) from the ground truth. Once obtained, the extracted style feature and the ground truth are both input into the generator. We also craft a consistency loss that guides the generated image to approximate the ground truth. After iterations, our generator is able to learn the mapping of styles corresponding to multiple sets of vectors. The proposed model can generate a large number of results consistent with the context semantics of the image. Moreover, we evaluated the effectiveness of our model on three datasets, i.e., CelebA, PlantVillage, and MauFlex. Compared to state-of-the-art inpainting methods, this model is able to offer desirable inpainting results with both better quality and higher diversity. The code and model will be made available on https://github.com/vivitsai/PiiGAN.
DCMay 2, 2019Code
Real Differences between OT and CRDT in Building Co-Editing Systems and Real World ApplicationsDavid Sun, Chengzheng Sun, Agustina Ng et al.
OT (Operational Transformation) was invented for supporting real-time co-editors in the late 1980s and has evolved to become a core technique used in today's working co-editors and adopted in major industrial products. CRDT (Commutative Replicated Data Type) for co-editors was first proposed around 2006, under the name of WOOT (WithOut Operational Transformation). Follow-up CRDT variations are commonly labeled as "post-OT" techniques and have made broad claims of superiority over OT solutions, in terms of correctness, time and space complexity, simplicity, etc. Over one decade later, however, OT remains the choice for building the vast majority of co-editors, whereas CRDT is rarely found in working co-editors. Why? To seek truth from facts, we set out to conduct a comprehensive and critical review of representative OT and CRDT solutions and working co-editors based on them. From this work, we have made important discoveries about OT and CRDT, and revealed facts and evidences that refute CRDT claims over OT on all accounts. We present our discoveries in three related and complementary articles. In prior two articles, we have revealed the similarities of OT and CRDT in following the same general transformation approach in co-editors, and their real differences in correctness and complexity. In this article, we examine the role of building working co-editors in shaping OT and CRDT research and solutions, and consequential differences in the choice between OT and CRDT in real world co-editors and industry products. In particular, we review the evolution of co-editors from research vehicles to real world applications, and discuss representative OT-based co-editors and alternative approaches in industry products and open source projects. Moreover, we evaluate CRDT-based co-editors in relation to published CRDT solutions, and clarify some myths surrounding "peer-to-peer" co-editing.
CVNov 21, 2025
Native 3D Editing with Full AttentionWeiwei Cai, Shuangkang Fang, Weicai Ye et al.
Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.
LGSep 1, 2025
Prediction, Generation of WWTPs microbiome community structures and Clustering of WWTPs various feature attributes using DE-BP model, SiTime-GAN model and DPNG-EPMC ensemble clustering algorithm with modulation of microbial ecosystem healthMingzhi Dai, Weiwei Cai, Xiang Feng et al.
Microbiomes not only underpin Earth's biogeochemical cycles but also play crucial roles in both engineered and natural ecosystems, such as the soil, wastewater treatment, and the human gut. However, microbiome engineering faces significant obstacles to surmount to deliver the desired improvements in microbiome control. Here, we use the backpropagation neural network (BPNN), optimized through differential evolution (DE-BP), to predict the microbial composition of activated sludge (AS) systems collected from wastewater treatment plants (WWTPs) located worldwide. Furthermore, we introduce a novel clustering algorithm termed Directional Position Nonlinear Emotional Preference Migration Behavior Clustering (DPNG-EPMC). This method is applied to conduct a clustering analysis of WWTPs across various feature attributes. Finally, we employ the Similar Time Generative Adversarial Networks (SiTime-GAN), to synthesize novel microbial compositions and feature attributes data. As a result, we demonstrate that the DE-BP model can provide superior predictions of the microbial composition. Additionally, we show that the DPNG-EPMC can be applied to the analysis of WWTPs under various feature attributes. Finally, we demonstrate that the SiTime-GAN model can generate valuable incremental synthetic data. Our results, obtained through predicting the microbial community and conducting analysis of WWTPs under various feature attributes, develop an understanding of the factors influencing AS communities.
CVApr 27, 2020
GraftNet: An Engineering Implementation of CNN for Fine-grained Multi-label TaskChunhua Jia, Lei Zhang, Hui Huang et al.
Multi-label networks with branches are proved to perform well in both accuracy and speed, but lacks flexibility in providing dynamic extension onto new labels due to the low efficiency of re-work on annotating and training. For multi-label classification task, to cover new labels we need to annotate not only newly collected images, but also the previous whole dataset to check presence of these new labels. Also training on whole re-annotated dataset costs much time. In order to recognize new labels more effectively and accurately, we propose GraftNet, which is a customizable tree-like network with its trunk pretrained with a dynamic graph for generic feature extraction, and branches separately trained on sub-datasets with single label to improve accuracy. GraftNet could reduce cost, increase flexibility, and incrementally handle new labels. Experimental results show that it has good performance on our human attributes recognition task, which is fine-grained multi-label classification.
DCMay 2, 2019
Real Differences between OT and CRDT in Correctness and Complexity for Consistency Maintenance in Co-EditorsDavid Sun, Chengzheng Sun, Agustina Ng et al.
OT (Operational Transformation) was invented for supporting real-time co-editors in the late 1980s and has evolved to become core techniques widely used in today's working co-editors and adopted in industrial products. CRDT (Commutative Replicated Data Type) for co-editors was first proposed around 2006, under the name of WOOT (WithOut Operational Transformation). Follow-up CRDT variations are commonly labeled as "post-OT" techniques capable of making concurrent operations natively commutative in co-editors. On top of that, CRDT solutions have made broad claims of superiority over OT solutions, and often portrayed OT as an incorrect and inefficient technique. Over one decade later, however, CRDT is rarely found in working co-editors; OT remains the choice for building the vast majority of today's co-editors. Contradictions between the reality and CRDT's purported advantages have been the source of much confusion and debate among co-editing researcher sand developers. To seek truth from facts, we set out to conduct a comprehensive and critical review on representative OT and CRDT solutions and co-editors based on them. From this work, we have made important discoveries about OT and CRDT, and revealed facts and evidences that refute CRDT claims over OT on all accounts. These discoveries help explain the underlying reasons for the choice between OT and CRDT in the real world. We report these results in a series of three articles. In the second article of this series, we reveal the differences between OT and CRDT in their basic approaches to realizing the same general transformation and how such differences had resulted in different challenges and consequential correctness and complexity issues. Moreover, we reveal hidden complexity and algorithmic flaws with representative CRDT solutions, and discuss common myths and facts related to correctness and complexity of OT and CRDT.
DCMay 2, 2019
Real Differences between OT and CRDT under a General Transformation Framework for Consistency Maintenance in Co-EditorsChengzheng Sun, David Sun, Agustina et al.
OT (Operational Transformation) was invented for supporting real-time co-editors in the late 1980s and has evolved to become a core technique used in today's working co-editors and adopted in major industrial products. CRDT (Commutative Replicated Data Type) for co-editors was first proposed around 2006, under the name of WOOT (WithOut Operational Transformation). Follow-up CRDT variations are commonly labeled as "post-OT" techniques capable of making concurrent operations natively commutative in co-editors. On top of that, CRDT solutions have made broad claims of superiority over OT solutions, and routinely portrayed OT as an incorrect, complex and inefficient technique. Over one decade later, however, OT remains the choice for building the vast majority of co-editors, whereas CRDT is rarely found in working co-editors. Contradictions between the reality and CRDT's purported advantages have been the source of much confusion and debate in co-editing communities. Have the vast majority of co-editors been unfortunate in choosing the faulty and inferior OT, or those CRDT claims are false? What are the real differences between OT and CRDT for co-editors? What are the key factors and underlying reasons behind the choices between OT and CRDT in the real world? To seek truth from facts, we set out to conduct a comprehensive and critical review on representative OT and CRDT solutions and working co-editors based on them. From this work, we have made important discoveries about OT and CRDT, and revealed facts and evidences that refute CRDT claims over OT on all accounts. We report our discoveries in a series of three articles and the current article is the first one in this series. We hope the discoveries from this work help clear up common misconceptions and confusions surrounding OT and CRDT, and accelerate progress in co-editing technology for real world applications.
DCOct 4, 2018
Real Differences between OT and CRDT for Co-EditorsChengzheng Sun, David Sun, Agustina et al.
OT (Operational Transformation) was invented for supporting real-time co-editors in the late 1980s and has evolved to become a core technique used in today's working co-editors and adopted in major industrial products. CRDT (Commutative Replicated Data Type) for co-editors was first proposed around 2006, under the name of WOOT (WithOut Operational Transformation). Follow-up CRDT variations are commonly labeled as "post-OT" techniques capable of making concurrent operations natively commutative, and have made broad claims of superiority over OT solutions, in terms of correctness, time and space complexity, simplicity, etc. Over one decade later, however, CRDT solutions are rarely found in working co-editors, while OT solutions remain the choice for building the vast majority of co-editors. Contradictions between this reality and CRDT's purported advantages have been the source of much debate and confusion in co-editing research and developer communities. What is CRDT really to co-editing? What are the real differences between OT and CRDT for co-editors? What are the key factors that may have affected the adoption of and choice between OT and CRDT for co-editors in the real world? In this paper, we report our discoveries, in relation to these questions and beyond, from a comprehensive review and comparison study on representative OT and CRDT solutions and working co-editors based on them. Moreover, this work reveals facts and presents evidences that refute CRDT claimed advantages over OT. We hope the results reported in this paper will help clear up common myths, misconceptions, and confusions surrounding alternative co-editing techniques, and accelerate progress in co-editing technology for real world applications.