CLJun 4, 2025Code
Seed-Coder: Let the Code Model Curate Data for ItselfByteDance Seed, Yuyu Zhang, Jing Su et al. · bytedance
Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.
CVDec 21, 2022Code
Low-Light Image and Video Enhancement: A Comprehensive Survey and BeyondShen Zheng, Yiling Ma, Jinqian Pan et al.
This paper presents a comprehensive survey of low-light image and video enhancement, addressing two primary challenges in the field. The first challenge is the prevalence of mixed over-/under-exposed images, which are not adequately addressed by existing methods. In response, this work introduces two enhanced variants of the SICE dataset: SICE_Grad and SICE_Mix, designed to better represent these complexities. The second challenge is the scarcity of suitable low-light video datasets for training and testing. To address this, the paper introduces the Night Wenzhou dataset, a large-scale, high-resolution video collection that features challenging fast-moving aerial scenes and streetscapes with varied illuminations and degradation. This study also conducts an extensive analysis of key techniques and performs comparative experiments using the proposed and current benchmark datasets. The survey concludes by highlighting emerging applications, discussing unresolved challenges, and suggesting future research directions within the LLIE community. The datasets are available at https://github.com/ShenZheng2000/LLIE_Survey.
CVNov 1, 2023Code
TPSeNCE: Towards Artifact-Free Realistic Rain Generation for Deraining and Object Detection in RainShen Zheng, Changjie Lu, Srinivasa G. Narasimhan
Rain generation algorithms have the potential to improve the generalization of deraining methods and scene understanding in rainy conditions. However, in practice, they produce artifacts and distortions and struggle to control the amount of rain generated due to a lack of proper constraints. In this paper, we propose an unpaired image-to-image translation framework for generating realistic rainy images. We first introduce a Triangular Probability Similarity (TPS) constraint to guide the generated images toward clear and rainy images in the discriminator manifold, thereby minimizing artifacts and distortions during rain generation. Unlike conventional contrastive learning approaches, which indiscriminately push negative samples away from the anchors, we propose a Semantic Noise Contrastive Estimation (SeNCE) strategy and reassess the pushing force of negative samples based on the semantic similarity between the clear and the rainy images and the feature similarity between the anchor and the negative samples. Experiments demonstrate realistic rain generation with minimal artifacts and distortions, which benefits image deraining and object detection in rain. Furthermore, the method can be used to generate realistic snowy and night images, underscoring its potential for broader applicability. Code is available at https://github.com/ShenZheng2000/TPSeNCE.
AINov 30, 2024
FullStack Bench: Evaluating LLMs as Full Stack CodersBytedance-Seed-Foundation-Code-Team, Yao Cheng, Jianfeng Chen et al. · bytedance
As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.
CVJul 13, 2022Code
PointNorm: Dual Normalization is All You Need for Point Cloud AnalysisShen Zheng, Jinqian Pan, Changjie Lu et al.
Point cloud analysis is challenging due to the irregularity of the point cloud data structure. Existing works typically employ the ad-hoc sampling-grouping operation of PointNet++, followed by sophisticated local and/or global feature extractors for leveraging the 3D geometry of the point cloud. Unfortunately, the sampling-grouping operations do not address the point cloud's irregularity, whereas the intricate local and/or global feature extractors led to poor computational efficiency. In this paper, we introduce a novel DualNorm module after the sampling-grouping operation to effectively and efficiently address the irregularity issue. The DualNorm module consists of Point Normalization, which normalizes the grouped points to the sampled points, and Reverse Point Normalization, which normalizes the sampled points to the grouped points. The proposed framework, PointNorm, utilizes local mean and global standard deviation to benefit from both local and global features while maintaining a faithful inference speed. Experiments show that we achieved excellent accuracy and efficiency on ModelNet40 classification, ScanObjectNN classification, ShapeNetPart Part Segmentation, and S3DIS Semantic Segmentation. Code is available at https://github.com/ShenZheng2000/PointNorm-for-Point-Cloud-Analysis.
IVApr 20, 2022Code
Unsupervised Domain Adaptation for Cardiac Segmentation: Towards Structure Mutual Information MaximizationChangjie Lu, Shen Zheng, Gaurav Gupta
Unsupervised domain adaptation approaches have recently succeeded in various medical image segmentation tasks. The reported works often tackle the domain shift problem by aligning the domain-invariant features and minimizing the domain-specific discrepancies. That strategy works well when the difference between a specific domain and between different domains is slight. However, the generalization ability of these models on diverse imaging modalities remains a significant challenge. This paper introduces UDA-VAE++, an unsupervised domain adaptation framework for cardiac segmentation with a compact loss function lower bound. To estimate this new lower bound, we develop a novel Structure Mutual Information Estimation (SMIE) block with a global estimator, a local estimator, and a prior information matching estimator to maximize the mutual information between the reconstruction and segmentation tasks. Specifically, we design a novel sequential reparameterization scheme that enables information flow and variance correction from the low-resolution latent space to the high-resolution latent space. Comprehensive experiments on benchmark cardiac segmentation datasets demonstrate that our model outperforms previous state-of-the-art qualitatively and quantitatively. The code is available at https://github.com/LOUEY233/Toward-Mutual-Information}{https://github.com/LOUEY233/Toward-Mutual-Information
CLSep 28, 2023Code
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondShen Zheng, Yuyu Zhang, Yijie Zhu et al.
With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.
CLApr 20, 2023
Why Does ChatGPT Fall Short in Providing Truthful Answers?Shen Zheng, Jie Huang, Kevin Chen-Chuan Chang
Recent advancements in large language models, such as ChatGPT, have demonstrated significant potential to impact various aspects of human life. However, ChatGPT still faces challenges in providing reliable and accurate answers to user questions. To better understand the model's particular weaknesses in providing truthful answers, we embark an in-depth exploration of open-domain question answering. Specifically, we undertake a detailed examination of ChatGPT's failures, categorized into: comprehension, factuality, specificity, and inference. We further pinpoint factuality as the most contributing failure and identify two critical abilities associated with factuality: knowledge memorization and knowledge recall. Through experiments focusing on factuality, we propose several potential enhancement strategies. Our findings suggest that augmenting the model with granular external knowledge and cues for knowledge recall can enhance the model's factuality in answering questions.
IVJun 28, 2022
AS-IntroVAE: Adversarial Similarity Distance Makes Robust IntroVAEChangjie Lu, Shen Zheng, Zirui Wang et al.
Recently, introspective models like IntroVAE and S-IntroVAE have excelled in image generation and reconstruction tasks. The principal characteristic of introspective models is the adversarial learning of VAE, where the encoder attempts to distinguish between the real and the fake (i.e., synthesized) images. However, due to the unavailability of an effective metric to evaluate the difference between the real and the fake images, the posterior collapse and the vanishing gradient problem still exist, reducing the fidelity of the synthesized images. In this paper, we propose a new variation of IntroVAE called Adversarial Similarity Distance Introspective Variational Autoencoder (AS-IntroVAE). We theoretically analyze the vanishing gradient problem and construct a new Adversarial Similarity Distance (AS-Distance) using the 2-Wasserstein distance and the kernel trick. With weight annealing on AS-Distance and KL-Divergence, the AS-IntroVAE are able to generate stable and high-quality images. The posterior collapse problem is addressed by making per-batch attempts to transform the image so that it better fits the prior distribution in the latent space. Compared with the per-image approach, this strategy fosters more diverse distributions in the latent space, allowing our model to produce images of great diversity. Comprehensive experiments on benchmark datasets demonstrate the effectiveness of AS-IntroVAE on image generation and reconstruction tasks.
AIFeb 5, 2025Code
BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem ProvingRan Xin, Chenguang Xi, Jie Yang et al.
Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating the underlying large proof search spaces. While the existing approaches primarily rely on value functions and/or Monte Carlo Tree Search (MCTS), the potential of simpler methods like Best-First Tree Search (BFS) remains underexplored. In this paper, we investigate whether BFS can achieve competitive performance in large-scale theorem proving tasks. We present BFS-Prover, a scalable expert iteration framework, featuring three key innovations. First, we implement strategic data filtering at each expert iteration round, excluding problems solvable via beam search node expansion to focus on harder cases. Second, we improve the sample efficiency of BFS through Direct Preference Optimization (DPO) applied to state-tactic pairs automatically annotated with compiler error feedback, refining the LLM's policy to prioritize productive expansions. Third, we employ length normalization in BFS to encourage exploration of deeper proof paths. BFS-Prover achieves a state-of-the-art score of $72.95\%$ on the MiniF2F test set and therefore challenges the perceived necessity of complex tree search methods, demonstrating that BFS can achieve competitive performance when properly scaled. To facilitate further research and development in this area, we have open-sourced our model at https://huggingface.co/ByteDance-Seed/BFS-Prover-V1-7B.
CVMar 19, 2024Code
Instance-Warp: Saliency Guided Image Warping for Unsupervised Domain AdaptationShen Zheng, Anurag Ghosh, Srinivasa G. Narasimhan
Driving is challenging in conditions like night, rain, and snow. Lack of good labeled datasets has hampered progress in scene understanding under such conditions. Unsupervised Domain Adaptation (UDA) using large labeled clear-day datasets is a promising research direction in such cases. However, many UDA methods are trained with dominant scene backgrounds (e.g., roads, sky, sidewalks) that appear dramatically different across domains. As a result, they struggle to learn effective features of smaller and often sparse foreground objects (e.g., people, vehicles, signs). In this work, we improve UDA training by applying in-place image warping to focus on salient objects. We design instance-level saliency guidance to adaptively oversample object regions and undersample background areas, which reduces adverse effects from background context and enhances backbone feature learning. Our approach improves adaptation across geographies, lighting, and weather conditions, and is agnostic to the task (segmentation, detection), domain adaptation algorithm, saliency guidance, and underlying model architecture. Result highlights include +6.1 mAP50 for BDD100K Clear $\rightarrow$ DENSE Foggy, +3.7 mAP50 for BDD100K Day $\rightarrow$ Night, +3.0 mAP50 for BDD100K Clear $\rightarrow$ Rainy, and +6.3 mIoU for Cityscapes $\rightarrow$ ACDC. Besides, Our method adds minimal training memory and no additional inference latency. Code is available at https://github.com/ShenZheng2000/Instance-Warp
CVNov 17, 2021Code
SAPNet: Segmentation-Aware Progressive Network for Perceptual Contrastive DerainingShen Zheng, Changjie Lu, Yuxiong Wu et al.
Deep learning algorithms have recently achieved promising deraining performances on both the natural and synthetic rainy datasets. As an essential low-level pre-processing stage, a deraining network should clear the rain streaks and preserve the fine semantic details. However, most existing methods only consider low-level image restoration. That limits their performances at high-level tasks requiring precise semantic information. To address this issue, in this paper, we present a segmentation-aware progressive network (SAPNet) based upon contrastive learning for single image deraining. We start our method with a lightweight derain network formed with progressive dilated units (PDU). The PDU can significantly expand the receptive field and characterize multi-scale rain streaks without the heavy computation on multi-scale images. A fundamental aspect of this work is an unsupervised background segmentation (UBS) network initialized with ImageNet and Gaussian weights. The UBS can faithfully preserve an image's semantic information and improve the generalization ability to unseen photos. Furthermore, we introduce a perceptual contrastive loss (PCL) and a learned perceptual image similarity loss (LPISL) to regulate model learning. By exploiting the rainy image and groundtruth as the negative and the positive sample in the VGG-16 latent space, we bridge the fine semantic details between the derained image and the groundtruth in a fully constrained manner. Comprehensive experiments on synthetic and real-world rainy images show our model surpasses top-performing methods and aids object detection and semantic segmentation with considerable efficacy. A Pytorch Implementation is available at https://github.com/ShenZheng2000/SAPNet-for-image-deraining.
CVOct 3, 2021Code
Semantic-Guided Zero-Shot Learning for Low-Light Image/Video EnhancementShen Zheng, Gaurav Gupta
Low-light images challenge both human perceptions and computer vision algorithms. It is crucial to make algorithms robust to enlighten low-light images for computational photography and computer vision applications such as real-time detection and segmentation. This paper proposes a semantic-guided zero-shot low-light enhancement network (SGZ) which is trained in the absence of paired images, unpaired datasets, and segmentation annotation. Firstly, we design an enhancement factor extraction network using depthwise separable convolution for an efficient estimate of the pixel-wise light deficiency of an low-light image. Secondly, we propose a recurrent image enhancement network to progressively enhance the low-light image with affordable model size. Finally, we introduce an unsupervised semantic segmentation network for preserving the semantic information during intensive enhancement. Extensive experiments on benchmark datasets and a low-light video demonstrate that our model outperforms the previous state-of-the-art. We further discuss the benefits of the proposed method for low-light detection and segmentation. Code is available at https://github.com/ShenZheng2000/Semantic-Guided-Low-Light-Image-Enhancement
CVDec 26, 2025
High-Fidelity and Long-Duration Human Image Animation with Diffusion TransformerShen Zheng, Jiaran Cai, Yuansheng Guan et al.
Recent progress in diffusion models has significantly advanced the field of human image animation. While existing methods can generate temporally consistent results for short or regular motions, significant challenges remain, particularly in generating long-duration videos. Furthermore, the synthesis of fine-grained facial and hand details remains under-explored, limiting the applicability of current approaches in real-world, high-quality applications. To address these limitations, we propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos. First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance. Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module, which enables video generation of arbitrary length. Finally, we introduce a novel data augmentation strategy and a skeleton alignment model to reduce the impact of human shape variations across different identities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving superior performance in both high-fidelity and long-duration human image animation.
CVOct 14, 2025
Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward FeedbackXingpei Ma, Shenneng Huang, Jiaran Cai et al.
Recent advances in diffusion models have significantly improved audio-driven human video generation, surpassing traditional methods in both quality and controllability. However, existing approaches still face challenges in lip-sync accuracy, temporal coherence for long video generation, and multi-character animation. In this work, we propose a diffusion transformer (DiT)-based framework for generating lifelike talking videos of arbitrary length, and introduce a training-free method for multi-character audio-driven animation. First, we employ a LoRA-based training strategy combined with a position shift inference approach, which enables efficient long video generation while preserving the capabilities of the foundation model. Moreover, we combine partial parameter updates with reward feedback to enhance both lip synchronization and natural body motion. Finally, we propose a training-free approach, Mask Classifier-Free Guidance (Mask-CFG), for multi-character animation, which requires no specialized datasets or model modifications and supports audio-driven animation for three or more characters. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving high-quality, temporally coherent, and multi-character audio-driven video generation in a simple, efficient, and cost-effective manner.
CVJun 11, 2024
ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work ZonesAnurag Ghosh, Shen Zheng, Robert Tamburo et al.
Perceiving and autonomously navigating through work zones is a challenging and underexplored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork dataset, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8$\times$) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via crop-scaling improves performance +14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 (+9.9 %) and 75.3% pathways have AE < 0.5 (+8.1 %).
CLMay 22, 2023
Quantifying Association Capabilities of Large Language Models and Its Implications on Privacy LeakageHanyin Shao, Jie Huang, Shen Zheng et al.
The advancement of large language models (LLMs) brings notable improvements across various applications, while simultaneously raising concerns about potential private data exposure. One notable capability of LLMs is their ability to form associations between different pieces of information, but this raises concerns when it comes to personally identifiable information (PII). This paper delves into the association capabilities of language models, aiming to uncover the factors that influence their proficiency in associating information. Our study reveals that as models scale up, their capacity to associate entities/information intensifies, particularly when target pairs demonstrate shorter co-occurrence distances or higher co-occurrence frequencies. However, there is a distinct performance gap when associating commonsense knowledge versus PII, with the latter showing lower accuracy. Despite the proportion of accurately predicted PII being relatively small, LLMs still demonstrate the capability to predict specific instances of email addresses and phone numbers when provided with appropriate prompts. These findings underscore the potential risk to PII confidentiality posed by the evolving capabilities of LLMs, especially as they continue to expand in scale and power.
AINov 26, 2021
Asian Giant Hornet Control based on Image Processing and Biological DispersalChangjie Lu, Shen Zheng, Hailu Qiu
The Asian giant hornet (AGH) appeared in Washington State appears to have a potential danger of bioinvasion. Washington State has collected public photos and videos of detected insects for verification and further investigation. In this paper, we analyze AGH using data analysis,statistics, discrete mathematics, and deep learning techniques to process the data to controlAGH spreading.First, we visualize the geographical distribution of insects in Washington State. Then we investigate insect populations to varying months of the year and different days of a month.Third, we employ wavelet analysis to examine the periodic spread of AGH. Fourth, we apply ordinary differential equations to examine AGH numbers at the different natural growthrate and reaction speed and output the potential propagation coefficient. Next, we leverage cellular automaton combined with the potential propagation coefficient to simulate the geographical spread under changing potential propagation. To update the model, we use delayed differential equations to simulate human intervention. We use the time difference between detection time and submission time to determine the unit of time to delay time. After that, we construct a lightweight CNN called SqueezeNet and assess its classification performance. We then relate several non-reference image quality metrics, including NIQE, image gradient, entropy, contrast, and TOPSIS to judge the cause of misclassification. Furthermore, we build a Random Forest classifier to identify positive and negative samples based on image qualities only. We also display the feature importance and conduct an error analysis. Besides, we present sensitivity analysis to verify the robustness of our models. Finally, we show the strengths and weaknesses of our model and derives the conclusions.
LGFeb 12, 2021
Exploiting Spline Models for the Training of Fully Connected Layers in Neural NetworkKanya Mo, Shen Zheng, Xiwei Wang et al.
The fully connected (FC) layer, one of the most fundamental modules in artificial neural networks (ANN), is often considered difficult and inefficient to train due to issues including the risk of overfitting caused by its large amount of parameters. Based on previous work studying ANN from linear spline perspectives, we propose a spline-based approach that eases the difficulty of training FC layers. Given some dataset, we first obtain a continuous piece-wise linear (CPWL) fit through spline methods such as multivariate adaptive regression spline (MARS). Next, we construct an ANN model from the linear spline model and continue to train the ANN model on the dataset using gradient descent optimization algorithms. Our experimental results and theoretical analysis show that our approach reduces the computational cost, accelerates the convergence of FC layers, and significantly increases the interpretability of the resulting model (FC layers) compared with standard ANN training with random parameter initialization followed by gradient descent optimizations.
CVDec 14, 2018
Pyramid Network with Online Hard Example Mining for Accurate Left Atrium SegmentationCheng Bian, Xin Yang, Jianqiang Ma et al.
Accurately segmenting left atrium in MR volume can benefit the ablation procedure of atrial fibrillation. Traditional automated solutions often fail in relieving experts from the labor-intensive manual labeling. In this paper, we propose a deep neural network based solution for automated left atrium segmentation in gadolinium-enhanced MR volumes with promising performance. We firstly argue that, for this volumetric segmentation task, networks in 2D fashion can present great superiorities in time efficiency and segmentation accuracy than networks with 3D fashion. Considering the highly varying shape of atrium and the branchy structure of associated pulmonary veins, we propose to adopt a pyramid module to collect semantic cues in feature maps from multiple scales for fine-grained segmentation. Also, to promote our network in classifying the hard examples, we propose an Online Hard Negative Example Mining strategy to identify voxels in slices with low classification certainties and penalize the wrong predictions on them. Finally, we devise a competitive training scheme to further boost the generalization ability of networks. Extensively verified on 20 testing volumes, our proposed framework achieves an average Dice of 92.83% in segmenting the left atria and pulmonary veins.