NANov 5, 2017
On the Stability and Accuracy of Partially and Fully Implicit Schemes for Phase Field ModelingJinchao Xu, Yukun Li, Shuonan Wu et al.
We study in this paper the accuracy and stability of partially and fully implicit schemes for phase field modeling. Through theoretical and numerical analysis of Allen-Cahn and Cahn-Hillard models, we investigate the potential problems of using partially implicit schemes, demonstrate the importance of using fully implicit schemes and discuss the limitation of energy stability that are often used to evaluate the quality of a numerical scheme for phase-field modeling. In particular, we make the following observations: 1. a convex splitting scheme (CSS in short) can be equivalent to some fully implicit scheme (FIS in short) with a much different time scaling and thus it may lack numerical accuracy; 2. most implicit schemes (in discussions) are energy-stable if the time-step size is sufficiently small; 3. a traditionally known conditionally energy-stable scheme still possess an unconditionally energy-stable physical solution; 4. an unconditionally energy-stable scheme is not necessarily better than a conditionally energy-stable scheme when the time step size is not small enough; 5. a first-order FIS for the Allen-Cahn model can be devised so that the maximum principle will be valid on the discrete level and hence the discrete phase variable satisfies $|u_h(x)|\le 1$ for all $x$ and, furthermore, the linearized discretized system can be effectively preconditioned by discrete Poisson operators.
CVJan 28Code
DeepSeek-OCR 2: Visual Causal FlowHaoran Wei, Yaofeng Sun, Yukun Li
We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR-2.
NAMar 24, 2013
Finite element approximations of the stochastic mean curvature flow of planar curves of graphsXiaobing Feng, Yukun Li, Andreas Prohl
This paper develops and analyzes a semi-discrete and a fully discrete finite element method for a one-dimensional quasilinear parabolic stochastic partial differential equation (SPDE) which describes the stochastic mean curvature flow for planar curves of graphs. To circumvent the difficulty caused by the low spatial regularity of the SPDE solution, a regularization procedure is first proposed to approximate the SPDE, and an error estimate for the regularized problem is derived. A semi-discrete finite element method, and a space-time fully discrete method are then proposed to approximate the solution of the regularized SPDE problem. Strong convergence with rates are established for both, semi- and fully discrete methods. Computational experiments are provided to study the interplay of the geometric evolution and gradient type-noises.
NANov 20, 2018
Strong convergence of a fully discrete finite element method for a class of semilinear stochastic partial differential equations with multiplicative noiseXiaobing Feng, Yukun Li, Yi Zhang
This paper develops and analyzes a fully discrete finite element method for a class of semilinear stochastic partial differential equations (SPDEs) with multiplicative noise. The nonlinearity in the diffusion term of the SPDEs is assumed to be globally Lipschitz and the nonlinearity in the drift term is only assumed to satisfy a one-side Lipschitz condition. The semilinear SPDEs considered in this paper is a direct generalization of the SODEs considered in [13]. There are several difficulties which need to be overcome for this generalization. First, obviously the spatial discretization, which does not appear in the SODE case, adds an extra layer of difficulty. It turns out a special discretization must be designed to guarantee certain properties for the numerical scheme and its stiffness matrix. In this paper we use a finite element interpolation technique to discretize the nonlinear drift term. Second, in order to prove the strong convergence of the proposed fully discrete finite element method, stability estimates for higher order moments of the $H^1$-seminorm of the numerical solution must be established, which are difficult and delicate. A judicious combination of the properties of the drift and diffusion terms and a nontrivial technique borrowed from [16] is used in this paper to achieve the goal. Finally, stability estimates for the second and higher order moments of the $L^2$-norm of the numerical solution is also difficult to obtain due to the fact that the mass matrix may not be diagonally dominant. This is done by utilizing the interpolation theory and the higher moment estimates for the $H^1$-seminorm of the numerical solution. After overcoming these difficulties, it is proved that the proposed fully discrete finite element method is convergent in strong norms with nearly optimal rates of convergence.
NAMar 12, 2019
Fully Discrete Mixed Finite Element Methods for the Stochastic Cahn-Hilliard Equation with Gradient-type Multiplicative NoiseXiaobing Feng, Yukun Li, Yi Zhang
This paper develops and analyzes some fully discrete mixed finite element methods for the stochastic Cahn-Hilliard equation with gradient-type multiplicative noise that is white in time and correlated in space. The stochastic Cahn-Hilliard equation is formally derived as a phase field formulation of the stochastically perturbed Hele-Shaw flow. The main result of this paper is to prove strong convergence with optimal rates for the proposed mixed finite element methods. To overcome the difficulty caused by the low regularity in time of the solution to the stochastic Cahn-Hilliard equation, the Hölder continuity in time with respect to various norms for the stochastic PDE solution is established, and it plays a crucial role in the error analysis. Numerical experiments are also provided to validate the theoretical results and to study the impact of noise on the Hele-Shaw flow as well as the interplay of the geometric evolution and gradient-type noise.
NANov 13, 2018
Energy Conserving Galerkin Approximation of Two Dimensional Wave Equations with Random CoefficientsChing-Shan Chou, Yukun Li, Dongbin Xiu
Wave propagation problems for heterogeneous media are known to have many applications in physics and engineering. Recently, there has been an increasing interest in stochastic effects due to the uncertainty, which may arise from impurities of the media. This work considers a two-dimensional wave equation with random coefficients which may be discontinuous in space. Generalized polynomial chaos method is used in conjunction with stochastic Galerkin approximation, and local discontinuous Galerkin method is used for spatial discretization. Our method is shown to be energy preserving in semi-discrete form as well as in fully discrete form, when leap-frog time discretization is used. Its convergence rate is proved to be optimal and the error grows linearly in time. The theoretical properties of the proposed scheme are validated by numerical tests.
CLFeb 4
ERNIE 5.0 Technical ReportHaifeng Wang, Hua Wu, Tian Wu et al.
In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
LGFeb 26, 2023
A Survey on Uncertainty Quantification Methods for Deep LearningWenchong He, Zhe Jiang, Tingsong Xiao et al.
Deep neural networks (DNNs) have achieved tremendous success in computer vision, natural language processing, and scientific and engineering domains. However, DNNs can make unexpected, incorrect, yet overconfident predictions, leading to serious consequences in high-stakes applications such as autonomous driving, medical diagnosis, and disaster response. Uncertainty quantification (UQ) estimates the confidence of DNN predictions in addition to their accuracy. In recent years, many UQ methods have been developed for DNNs. It is valuable to systematically categorize these methods and compare their strengths and limitations. Existing surveys mostly categorize UQ methodologies by neural network architecture or Bayesian formulation, while overlooking the uncertainty sources each method addresses, making it difficult to select an appropriate approach in practice. To fill this gap, this paper presents a taxonomy of UQ methods for DNNs based on uncertainty sources (e.g., data versus model uncertainty). We summarize the advantages and disadvantages of each category, and illustrate how UQ can be applied to machine learning problems (e.g., active learning, out-of-distribution robustness, and deep reinforcement learning). We also identify future research directions, including UQ for large language models (LLMs), AI-driven scientific simulations, and deep neural networks with structured outputs.
CVDec 13, 2024Code
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal UnderstandingZhiyu Wu, Xiaokang Chen, Zizheng Pan et al.
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
SEMar 16Code
daVinci-Env: Open SWE Environment Synthesis at ScaleDayuan Fu, Shenyu Wu, Yunze Wu et al.
Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.
LGNov 19, 2022
NVDiff: Graph Generation through the Diffusion of Node VectorsXiaohui Chen, Yukun Li, Aonan Zhang et al.
Learning to generate graphs is challenging as a graph is a set of pairwise connected, unordered nodes encoding complex combinatorial structures. Recently, several works have proposed graph generative models based on normalizing flows or score-based diffusion models. However, these models need to generate nodes and edges in parallel from the same process, whose dimensionality is unnecessarily high. We propose NVDiff, which takes the VGAE structure and uses a score-based generative model (SGM) as a flexible prior to sample node vectors. By modeling only node vectors in the latent space, NVDiff significantly reduces the dimension of the diffusion process and thus improves sampling speed. Built on the NVDiff framework, we introduce an attention-based score network capable of capturing both local and global contexts of graphs. Experiments indicate that NVDiff significantly reduces computations and can model much larger graphs than competing methods. At the same time, it achieves superior or competitive performances over various datasets compared to previous methods.
CLJan 12
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language ModelsXin Cheng, Wangding Zeng, Damai Dai et al.
While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.
NAAug 26, 2018
Analysis of the Morley element for the Cahn-Hilliard equation and the Hele-Shaw flowShuonan Wu, Yukun Li
The paper analyzes the Morley element method for the Cahn-Hilliard equation. The objective is to derive the optimal error estimates and to prove the zero-level sets of the Cahn-Hilliard equation approximate the Hele-Shaw flow. If the piecewise $L^{\infty}(H^2)$ error bound is derived by choosing test function directly, we cannot obtain the optimal error order, and we cannot establish the error bound which depends on $\frac{1}ε$ polynomially either. To overcome this difficulty, this paper proves them by the following steps, and the result in each next step cannot be established without using the result in its previous one. First, it proves some a priori estimates of the exact solution $u$, and these regularity results are minimal to get the main results; Second, it establishes ${L^{\infty}(L^2)}$ and piecewise ${L^2(H^2)}$ error bounds which depend on $\frac{1}ε$ polynomially based on the piecewise ${L^{\infty}(H^{-1})}$ and ${L^2(H^1)}$ error bounds; Third, it establishes piecewise ${L^{\infty}(H^2)}$ optimal error bound which depends on $\frac{1}ε$ polynomially based on the piecewise ${L^{\infty}(L^2)}$ and ${L^2(H^2)}$ error bounds; Finally, it proves the ${L^\infty(L^\infty)}$ error bound and the approximation to the Hele-Shaw flow based on the piecewise ${L^{\infty}(H^2)}$ error bound. The nonstandard techniques are used in these steps such as the generalized coercivity result, integration by part in space, summation by part in time, and special properties of the Morley elements. If one of these techniques is lacked, either we can only obtain the sub-optimal piecewise ${L^{\infty}(H^2)}$ error order, or we can merely obtain the error bounds which are exponentially dependent on $\frac{1}ε$. Numerical results are presented to validate the optimal $L^\infty(H^2)$ error order and the asymptotic behavior of the solutions of the Cahn-Hilliard equation.
AIMay 4Code
AcademiClaw: When Students Set Challenges for AI AgentsJunjie Yu, Pengrui Lu, Weiye Si et al.
Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.
CLDec 2, 2025
DeepSeek-V3.2: Pushing the Frontier of Open Large Language ModelsDeepSeek-AI, Aixin Liu, Aoxue Mei et al.
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
CVOct 21, 2025Code
DeepSeek-OCR: Contexts Optical CompressionHaoran Wei, Yaofeng Sun, Yukun Li
We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.
SEJun 17, 2024Code
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code IntelligenceDeepSeek-AI, Qihao Zhu, Daya Guo et al.
We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.
CLJul 29, 2019Code
ERNIE 2.0: A Continual Pre-training Framework for Language UnderstandingYu Sun, Shuohuan Wang, Yukun Li et al.
Recently, pre-trained models have achieved state-of-the-art results in various language understanding tasks, which indicates that pre-training on large-scale corpora may play a crucial role in natural language processing. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entity, semantic closeness and discourse relations. In order to extract to the fullest extent, the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named ERNIE 2.0 which builds and learns incrementally pre-training tasks through constant multi-task learning. Experimental results demonstrate that ERNIE 2.0 outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several common tasks in Chinese. The source codes and pre-trained models have been released at https://github.com/PaddlePaddle/ERNIE.
CVMar 15, 2024
CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary LearningYukun Li, Guansong Pang, Wei Suo et al.
This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.
CLNov 3, 2024
Graph-based Confidence Calibration for Large Language ModelsYukun Li, Sijia Wang, Lifu Huang et al.
Reliable confidence estimation is essential for enhancing the trustworthiness of large language models (LLMs), especially in high-stakes scenarios. Despite its importance, accurately estimating confidence in LLM responses remains a significant challenge. In this work, we propose using an auxiliary learning model to assess response correctness based on the self-consistency of multiple outputs generated by the LLM. Our method builds a consistency graph to represent the agreement among multiple responses and uses a graph neural network (GNN) to estimate the likelihood that each response is correct. Experiments demonstrate that this method has strong calibration performance on various benchmark datasets and generalizes well to out-of-domain cases.
CVApr 3, 2024
Enhancing Diffusion-based Point Cloud Generation with Smoothness ConstraintYukun Li, Liping Liu
Diffusion models have been popular for point cloud generation tasks. Existing works utilize the forward diffusion process to convert the original point distribution into a noise distribution and then learn the reverse diffusion process to recover the point distribution from the noise distribution. However, the reverse diffusion process can produce samples with non-smooth points on the surface because of the ignorance of the point cloud geometric properties. We propose alleviating the problem by incorporating the local smoothness constraint into the diffusion framework for point cloud generation. Experiments demonstrate the proposed model can generate realistic shapes and smoother point clouds, outperforming multiple state-of-the-art methods.
CVApr 10
Hitem3D 2.0: Multi-View Guided Native 3D Texture GenerationHuiang He, Shengchu Zhao, Jianwen Huang et al.
Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.
LGMay 2, 2025
Incorporating Inductive Biases to Energy-based Generative ModelsYukun Li, Li-Ping Liu
With the advent of score-matching techniques for model training and Langevin dynamics for sample generation, energy-based models (EBMs) have gained renewed interest as generative models. Recent EBMs usually use neural networks to define their energy functions. In this work, we introduce a novel hybrid approach that combines an EBM with an exponential family model to incorporate inductive bias into data modeling. Specifically, we augment the energy term with a parameter-free statistic function to help the model capture key data statistics. Like an exponential family model, the hybrid model aims to align the distribution statistics with data statistics during model training, even when it only approximately maximizes the data likelihood. This property enables us to impose constraints on the hybrid model. Our empirical study validates the hybrid model's ability to match statistics. Furthermore, experimental results show that data fitting and generation improve when suitable informative statistics are incorporated into the hybrid model.
CVJul 27, 2021
Transferable Knowledge-Based Multi-Granularity Aggregation Network for Temporal Action Localization: Submission to ActivityNet Challenge 2021Haisheng Su, Peiqin Zhuang, Yukun Li et al.
This technical report presents an overview of our solution used in the submission to 2021 HACS Temporal Action Localization Challenge on both Supervised Learning Track and Weakly-Supervised Learning Track. Temporal Action Localization (TAL) requires to not only precisely locate the temporal boundaries of action instances, but also accurately classify the untrimmed videos into specific categories. However, Weakly-Supervised TAL indicates locating the action instances using only video-level class labels. In this paper, to train a supervised temporal action localizer, we adopt Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals through ``local and global" temporal context aggregation and complementary as well as progressive boundary refinement. As for the WSTAL, a novel framework is proposed to handle the poor quality of CAS generated by simple classification network, which can only focus on local discriminative parts, rather than locate the entire interval of target actions. Further inspired by the transfer learning method, we also adopt an additional module to transfer the knowledge from trimmed videos (HACS Clips dataset) to untrimmed videos (HACS Segments dataset), aiming at promoting the classification performance on untrimmed videos. Finally, we employ a boundary regression module embedded with Outer-Inner-Contrastive (OIC) loss to automatically predict the boundaries based on the enhanced CAS. Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
IRJun 7, 2021
Pre-trained Language Model for Web-scale Retrieval in Baidu SearchYiding Liu, Guan Huang, Jiaxiang Liu et al.
Retrieval is a crucial stage in web search that identifies a small set of query-relevant candidates from a billion-scale corpus. Discovering more semantically-related candidates in the retrieval stage is very promising to expose more high-quality results to the end users. However, it still remains non-trivial challenges of building and deploying effective retrieval models for semantic matching in real search engine. In this paper, we describe the retrieval system that we developed and deployed in Baidu Search. The system exploits the recent state-of-the-art Chinese pretrained language model, namely Enhanced Representation through kNowledge IntEgration (ERNIE), which facilitates the system with expressive semantic matching. In particular, we developed an ERNIE-based retrieval model, which is equipped with 1) expressive Transformer-based semantic encoders, and 2) a comprehensive multi-stage training paradigm. More importantly, we present a practical system workflow for deploying the model in web-scale retrieval. Eventually, the system is fully deployed into production, where rigorous offline and online experiments were conducted. The results show that the system can perform high-quality candidate retrieval, especially for those tail queries with uncommon demands. Overall, the new retrieval system facilitated by pretrained language model (i.e., ERNIE) can largely improve the usability and applicability of our search engine.
CLJan 26, 2020
ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language GenerationDongling Xiao, Han Zhang, Yukun Li et al.
Current pre-training works in natural language generation pay little attention to the problem of exposure bias on downstream tasks. To address this issue, we propose an enhanced multi-flow sequence to sequence pre-training and fine-tuning framework named ERNIE-GEN, which bridges the discrepancy between training and inference with an infilling generation mechanism and a noise-aware generation method. To make generation closer to human writing patterns, this framework introduces a span-by-span generation flow that trains the model to predict semantically-complete spans consecutively rather than predicting word by word. Unlike existing pre-training methods, ERNIE-GEN incorporates multi-granularity target sampling to construct pre-training data, which enhances the correlation between encoder and decoder. Experimental results demonstrate that ERNIE-GEN achieves state-of-the-art results with a much smaller amount of pre-training data and parameters on a range of language generation tasks, including abstractive summarization (Gigaword and CNN/DailyMail), question generation (SQuAD), dialogue generation (Persona-Chat) and generative question answering (CoQA).
CLApr 19, 2019
ERNIE: Enhanced Representation through Knowledge IntegrationYu Sun, Shuohuan Wang, Yukun Li et al.
We present a novel language representation model enhanced by knowledge called ERNIE (Enhanced Representation through kNowledge IntEgration). Inspired by the masking strategy of BERT, ERNIE is designed to learn language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking. Entity-level strategy masks entities which are usually composed of multiple words.Phrase-level strategy masks the whole phrase which is composed of several words standing together as a conceptual unit.Experimental results show that ERNIE outperforms other baseline methods, achieving new state-of-the-art results on five Chinese natural language processing tasks including natural language inference, semantic similarity, named entity recognition, sentiment analysis and question answering. We also demonstrate that ERNIE has more powerful knowledge inference capacity on a cloze test.
NASep 15, 2018
Error analysis of a fully discrete Morley finite element approximation for the Cahn-Hilliard equationYukun Li
This paper proposes and analyzes the Morley element method for the Cahn-Hilliard equation. It is a fourth order nonlinear singular perturbation equation arises from the binary alloy problem in materials science, and its limit is proved to approach the Hele-Shaw flow. If the $L^2(Ω)$ error estimate is considered directly as in paper \cite{elliott1989nonconforming}, we can only prove that the error bound depends on the exponential function of $\frac{1}ε$. Instead, this paper derives the error bound which depends on the polynomial function of $\frac{1}ε$ by considering the discrete $H^{-1}$ error estimate first. There are two main difficulties in proving this polynomial dependence of the discrete $H^{-1}$ error estimate. Firstly, it is difficult to prove discrete energy law and discrete stability results due to the complex structure of the bilinear form of the Morley element discretization. This paper overcomes this difficulty by defining four types of discrete inverse Laplace operators and exploring the relations between these discrete inverse Laplace operators and continuous inverse Laplace operator. Each of these operators plays important roles, and their relations are crucial in proving the discrete energy law, discrete stability results and error estimates. Secondly, it is difficult to prove the discrete spectrum estimate in the Morley element space because the Morley element space intersects with the $C^1$ conforming finite element space but they are not contained in each other. Instead of proving this discrete spectrum estimate in the Morley element space, this paper proves a generalized coercivity result by exploring properties of the enriching operators and using the discrete spectrum estimate in its $C^1$ conforming relative finite element space, which can be obtained by using the spectrum estimate of the Cahn-Hilliard operator.