Yi Wei

CV
h-index38
49papers
3,626citations
Novelty51%
AI Score59

49 Papers

CVMar 16, 2023Code
SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Yi Wei, Linqing Zhao, Wenzhao Zheng et al. · tsinghua

3D scene understanding plays a vital role in vision-based autonomous driving. While most existing methods focus on 3D object detection, they have difficulty describing real-world objects of arbitrary shapes and infinite classes. Towards a more comprehensive perception of a 3D scene, in this paper, we propose a SurroundOcc method to predict the 3D occupancy with multi-camera images. We first extract multi-scale features for each image and adopt spatial 2D-3D attention to lift them to the 3D volume space. Then we apply 3D convolutions to progressively upsample the volume features and impose supervision on multiple levels. To obtain dense occupancy prediction, we design a pipeline to generate dense occupancy ground truth without expansive occupancy annotations. Specifically, we fuse multi-frame LiDAR scans of dynamic objects and static scenes separately. Then we adopt Poisson Reconstruction to fill the holes and voxelize the mesh to get dense occupancy labels. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our method. Code and dataset are available at https://github.com/weiyithu/SurroundOcc

CVApr 7, 2022
SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation

Yi Wei, Linqing Zhao, Wenzhao Zheng et al. · tsinghua

Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric constraints enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore the correlations among multiple surrounding cameras, which are typically available for modern self-driving vehicles. In this paper, we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views. We apply cross-view self-attention to efficiently enable the global interactions between multi-camera feature maps. Different from self-supervised monocular depth estimation, we are able to predict real-world scales given multi-camera extrinsic matrices. To achieve this goal, we adopt the two-frame structure-from-motion to extract scale-aware pseudo depths to pretrain the models. Further, instead of predicting the ego-motion of each individual camera, we estimate a universal ego-motion of the vehicle and transfer it to each view to achieve multi-view ego-motion consistency. In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets DDAD and nuScenes.

CVMar 28, 2022
LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection

Yi Wei, Zibu Wei, Yongming Rao et al. · tsinghua

In this paper, we propose the LiDAR Distillation to bridge the domain gap induced by different LiDAR beams for 3D object detection. In many real-world applications, the LiDAR points used by mass-produced robots and vehicles usually have fewer beams than that in large-scale public datasets. Moreover, as the LiDARs are upgraded to other product models with different beam amount, it becomes challenging to utilize the labeled data captured by previous versions' high-resolution sensors. Despite the recent progress on domain adaptive 3D detection, most methods struggle to eliminate the beam-induced domain gap. We find that it is essential to align the point cloud density of the source domain with that of the target domain during the training process. Inspired by this discovery, we propose a progressive framework to mitigate the beam-induced domain shift. In each iteration, we first generate low-beam pseudo LiDAR by downsampling the high-beam point clouds. Then the teacher-student framework is employed to distill rich information from the data with more beams. Extensive experiments on Waymo, nuScenes and KITTI datasets with three different LiDAR-based detectors demonstrate the effectiveness of our LiDAR Distillation. Notably, our approach does not increase any additional computation cost for inference.

CVMar 7, 2023
OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

Xiaofeng Wang, Zheng Zhu, Wenbo Xu et al.

Semantic occupancy perception is essential for autonomous driving, as automated vehicles require a fine-grained perception of the 3D urban structures. However, existing relevant benchmarks lack diversity in urban scenes, and they only evaluate front-view predictions. Towards a comprehensive benchmarking of surrounding perception algorithms, we propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark. In the OpenOccupancy benchmark, we extend the large-scale nuScenes dataset with dense semantic occupancy annotations. Previous annotations rely on LiDAR points superimposition, where some occupancy labels are missed due to sparse LiDAR channels. To mitigate the problem, we introduce the Augmenting And Purifying (AAP) pipeline to ~2x densify the annotations, where ~4000 human hours are involved in the labeling process. Besides, camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark. Furthermore, considering the complexity of surrounding occupancy perception lies in the computational burden of high-resolution 3D predictions, we propose the Cascade Occupancy Network (CONet) to refine the coarse prediction, which relatively enhances the performance by ~30% than the baseline. We hope the OpenOccupancy benchmark will boost the development of surrounding occupancy perception algorithms.

ITOct 23, 2022
Fast Beam Alignment via Pure Exploration in Multi-armed Bandits

Yi Wei, Zixin Zhong, Vincent Y. F. Tan

The beam alignment (BA) problem consists in accurately aligning the transmitter and receiver beams to establish a reliable communication link in wireless communication systems. Existing BA methods search the entire beam space to identify the optimal transmit-receive beam pair. This incurs a significant latency when the number of antennas is large. In this work, we develop a bandit-based fast BA algorithm to reduce BA latency for millimeter-wave (mmWave) communications. Our algorithm is named Two-Phase Heteroscedastic Track-and-Stop (2PHT\&S). We first formulate the BA problem as a pure exploration problem in multi-armed bandits in which the objective is to minimize the required number of time steps given a certain fixed confidence level. By taking advantage of the correlation structure among beams that the information from nearby beams is similar and the heteroscedastic property that the variance of the reward of an arm (beam) is related to its mean, the proposed algorithm groups all beams into several beam sets such that the optimal beam set is first selected and the optimal beam is identified in this set after that. Theoretical analysis and simulation results on synthetic and semi-practical channel data demonstrate the clear superiority of the proposed algorithm vis-à-vis other baseline competitors.

ROAug 6, 2022
Smart Explorer: Recognizing Objects in Dense Clutter via Interactive Exploration

Zhenyu Wu, Ziwei Wang, Zibu Wei et al.

Recognizing objects in dense clutter accurately plays an important role to a wide variety of robotic manipulation tasks including grasping, packing, rearranging and many others. However, conventional visual recognition models usually miss objects because of the significant occlusion among instances and causes incorrect prediction due to the visual ambiguity with the high object crowdedness. In this paper, we propose an interactive exploration framework called Smart Explorer for recognizing all objects in dense clutters. Our Smart Explorer physically interacts with the clutter to maximize the recognition performance while minimize the number of motions, where the false positives and negatives can be alleviated effectively with the optimal accuracy-efficiency trade-offs. Specifically, we first collect the multi-view RGB-D images of the clutter and reconstruct the corresponding point cloud. By aggregating the instance segmentation of RGB images across views, we acquire the instance-wise point cloud partition of the clutter through which the existed classes and the number of objects for each class are predicted. The pushing actions for effective physical interaction are generated to sizably reduce the recognition uncertainty that consists of the instance segmentation entropy and multi-view object disagreement. Therefore, the optimal accuracy-efficiency trade-off of object recognition in dense clutter is achieved via iterative instance prediction and physical interaction. Extensive experiments demonstrate that our Smart Explorer acquires promising recognition accuracy with only a few actions, which also outperforms the random pushing by a large margin.

CVNov 13, 2025Code
Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

Zhiyuan Hu, Zheng Sun, Yi Wei et al.

The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

OCMar 31
Adaptive Delayed-Update Cyclic Algorithm for Variational Inequalities

Yi Wei, Xufeng Cai, Jelena Diakonikolas

Cyclic block coordinate methods are a fundamental class of first-order algorithms, widely used in practice for their simplicity and strong empirical performance. Yet, their theoretical behavior remains challenging to explain, and setting their step sizes -- beyond classical coordinate descent for minimization -- typically requires careful tuning or line-search machinery. In this work, we develop $\texttt{ADUCA}$ (Adaptive Delayed-Update Cyclic Algorithm), a cyclic algorithm addressing a broad class of Minty variational inequalities with monotone Lipschitz operators. $\texttt{ADUCA}$ is parameter-free: it requires no global or block-wise Lipschitz constants and uses no per-epoch line search, except at initialization. A key feature of the algorithm is using operator information delayed by a full cycle, which makes the algorithm compatible with parallel and distributed implementations, and attractive due to weakened synchronization requirements across blocks. We prove that $\texttt{ADUCA}$ attains (near) optimal global oracle complexity as a function of target error $ε>0,$ scaling with $1/ε$ for monotone operators, or with $\log^2(1/ε)$ for operators that are strongly monotone.

CVJul 9, 2025Code
ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

Sherry X. Chen, Yi Wei, Luowei Zhou et al.

Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model's average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%). Our code and models are available at https://github.com/SherryXTChen/ADIEE.git.

CVJan 19, 2022Code
A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo

Wang Zhao, Shaohui Liu, Yi Wei et al.

In this paper, we introduce a deep multi-view stereo (MVS) system that jointly predicts depths, surface normals and per-view confidence maps. The key to our approach is a novel solver that iteratively solves for per-view depth map and normal map by optimizing an energy potential based on the locally planar assumption. Specifically, the algorithm updates depth map by propagating from neighboring pixels with slanted planes, and updates normal map with local probabilistic plane fitting. Both two steps are monitored by a customized confidence map. This solver is not only effective as a post-processing tool for plane-based depth refinement and completion, but also differentiable such that it can be efficiently integrated into deep learning pipelines. Our multi-view stereo system employs multiple optimization steps of the solver over the initial prediction of depths and surface normals. The whole system can be trained end-to-end, decoupling the challenging problem of matching pixels within poorly textured regions from the cost-volume based neural network. Experimental results on ScanNet and RGB-D Scenes V2 demonstrate state-of-the-art performance of the proposed deep MVS system on multi-view depth estimation, with our proposed solver consistently improving the depth quality over both conventional and deep learning based MVS pipelines. Code is available at https://github.com/thuzhaowang/idn-solver.

CVSep 2, 2021Code
NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

Yi Wei, Shaohui Liu, Yongming Rao et al.

In this work, we present a new multi-view depth estimation method that utilizes both conventional reconstruction and learning-based priors over the recently proposed neural radiance fields (NeRF). Unlike existing neural network based optimization method that relies on estimated correspondences, our method directly optimizes over implicit volumes, eliminating the challenging step of matching pixels in indoor scenes. The key to our approach is to utilize the learning-based priors to guide the optimization process of NeRF. Our system firstly adapts a monocular depth network over the target scene by finetuning on its sparse SfM+MVS reconstruction from COLMAP. Then, we show that the shape-radiance ambiguity of NeRF still exists in indoor environments and propose to address the issue by employing the adapted depth priors to monitor the sampling process of volume rendering. Finally, a per-pixel confidence map acquired by error computation on the rendered image can be used to further improve the depth quality. Experiments show that our proposed framework significantly outperforms state-of-the-art methods on indoor scenes, with surprising findings presented on the effectiveness of correspondence-based optimization and NeRF-based optimization over the adapted depth priors. In addition, we show that the guided optimization scheme does not sacrifice the original synthesis capability of neural radiance fields, improving the rendering quality on both seen and novel views. Code is available at https://github.com/weiyithu/NerfingMVS.

CVDec 18, 2020Code
SegGroup: Seg-Level Supervision for 3D Instance and Semantic Segmentation

An Tao, Yueqi Duan, Yi Wei et al.

Most existing point cloud instance and semantic segmentation methods rely heavily on strong supervision signals, which require point-level labels for every point in the scene. However, such strong supervision suffers from large annotation costs, arousing the need to study efficient annotating. In this paper, we discover that the locations of instances matter for both instance and semantic 3D scene segmentation. By fully taking advantage of locations, we design a weakly-supervised point cloud segmentation method that only requires clicking on one point per instance to indicate its location for annotation. With over-segmentation for pre-processing, we extend these location annotations into segments as seg-level labels. We further design a segment grouping network (SegGroup) to generate point-level pseudo labels under seg-level labels by hierarchically grouping the unlabeled segments into the relevant nearby labeled segments, so that existing point-level supervised segmentation models can directly consume these pseudo labels for training. Experimental results show that our seg-level supervised method (SegGroup) achieves comparable results with the fully annotated point-level supervised methods. Moreover, it outperforms the recent weakly-supervised methods given a fixed annotation budget. Code is available at https://github.com/AnTao97/SegGroup.

SENov 26, 2015Code
SWIM: Synthesizing What I Mean

Mukund Raghothaman, Yi Wei, Youssef Hamadi

Modern programming frameworks come with large libraries, with diverse applications such as for matching regular expressions, parsing XML files and sending email. Programmers often use search engines such as Google and Bing to learn about existing APIs. In this paper, we describe SWIM, a tool which suggests code snippets given API-related natural language queries such as "generate md5 hash code". We translate user queries into the APIs of interest using clickthrough data from the Bing search engine. Then, based on patterns learned from open-source code repositories, we synthesize idiomatic code describing the use of these APIs. We introduce \emph{structured call sequences} to capture API-usage patterns. Structured call sequences are a generalized form of method call sequences, with if-branches and while-loops to represent conditional and repeated API usage patterns, and are simple to extract and amenable to synthesis. We evaluated SWIM with 30 common C# API-related queries received by Bing. For 70% of the queries, the first suggested snippet was a relevant solution, and a relevant solution was present in the top 10 results for all benchmarked queries. The online portion of the workflow is also very responsive, at an average of 1.5 seconds per snippet.

LGMay 7
AffineLens: Capturing the Continuous Piecewise Affine Functions of Neural Networks

Yi Wei, Xuan Qi, Furao shen et al.

Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input--output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, and shapes of its affine regions. However, existing interpretability and expressivity analyses often rely on indirect proxies (e.g., activation statistics or theoretical upper bounds) and rarely offer practical, accurate tools for enumerating and visualizing the induced region partition under realistic architectures and bounded input domains. In this work, we present AffineLens, a unified framework for computing the hyperplane arrangements and polyhedral structures underlying PANNs. Given a calibrated (bounded) input polytope, AffineLens identifies the subset of neuron-induced hyperplanes that intersect the domain, enumerates the resulting affine sub-regions in a layer-wise manner, and returns provably non-empty maximal CPA regions together with interior representatives. The framework further provides visualizations of region partitioning and decision boundaries, enabling qualitative inspection alongside quantitative region counts. By exploiting the affine restriction property of CPA networks under fixed activation patterns, AffineLens supports a broad class of modern components, including batch normalization, pooling, residual connections, multilayer perceptrons, and convolutional layers. Finally, we use AffineLens to perform a systematic empirical study of architectural expressivity, comparing networks through region complexity metrics and revealing how design choices influence the geometry of learned functions.

LGMay 7
Region Seeding via Pre-Activation Regularization: A Geometric View from Piecewise Affine Nerual Networks

Yi Wei, Xuan Qi, Furao Shen

Deep networks with continuous piecewise affine activations induce polyhedral partitions of the input space, making the number of realized affine regions a natural measure of expressive capacity and a key determinant of how well the model can approximate nonlinear target functions. In practice, standard training realizes far fewer region refinements in data-visited neighborhoods than the architecture could in principle support, while existing region-count theory is primarily architectural and offers little guidance on how optimization shapes the realized partition near the data. Our theory provides a sufficient condition under which bringing neuron switching surfaces sufficiently close to data points ensures their intersection with local neighborhoods, which in turn implies a strict increase in the local affine-region count, yielding a principled training-time handle for seeding data-relevant partitions early in optimization. Guided by these results, we propose a plug-and-play region-seeding regularizer that encourages early partitioning while allowing task-driven refinement to dominate later in training. Experiments show that the regularizer increases the number of realized affine regions via exact enumeration and improves overall performance on toy datasets, while also improving early-stage accuracy and achieving comparable (or slightly improved) final accuracy on ImageNet-1k for classical models.

LGMay 6
Training-Time Batch Normalization Reshapes Local Partition Geometry in Piecewise-Affine Networks

Xuan Qi, Yi Wei, Fanqi Yu et al.

Batch normalization (BN) is central to modern deep networks, but its effect on the realized function during training remains less understood than its optimization benefits. We study training-time BN in continuous piecewise-affine (CPA) networks through the geometry of switching hyperplanes and the induced affine-region partition. Conditioned on a mini-batch, we show that BN defines for each neuron a reference hyperplane through the batch centroid, and that breakpoint-switching hyperplanes are parallel translates whose offsets are expressed in batch-standardized coordinates and are independent of the raw bias. This yields an exact criterion for when a switching hyperplane intersects a local $\ell_\infty$ window and motivates a local region-density functional based on exact affine-region counts. Under explicit sufficient conditions, we show that BN increases expected local partition refinement in ReLU and more general piecewise-affine networks, and that this mechanism transfers locally through depth inside parent affine regions where the upstream representation map is an affine embedding. These results provide a function-level geometric account of training-time BN as a batch-conditional recentering mechanism near the data.

IRMar 24
KARMA: Knowledge-Action Regularized Multimodal Alignment for Personalized Search at Taobao

Zhi Sun, Wenming Zhang, Yi Wei et al.

Large Language Models (LLMs) are equipped with profound semantic knowledge, making them a natural choice for injecting semantic generalization into personalized search systems. However, in practice we find that directly fine-tuning LLMs on industrial personalized tasks (e.g. next item prediction) often yields suboptimal results. We attribute this bottleneck to a critical Knowledge--Action Gap: the inherent conflict between preserving pre-trained semantic knowledge and aligning with specific personalized actions by discriminative objectives. Empirically, action-only training objectives induce Semantic Collapse, such as attention ``sinks''. This degradation severely cripples the LLM's generalization, failing to bring improvements to personalized search systems. We propose KARMA (Knowledge--Action Regularized Multimodal Alignment), a unified framework that treats semantic reconstruction as a train-only regularizer. KARMA optimizes a next-interest embedding for retrieval (Action) while enforcing semantic decodability (Knowledge) through two complementary objectives: (i) history-conditioned semantic generation, which anchors optimization to the LLM's native next-token distribution, and (ii) embedding-conditioned semantic reconstruction, which constrains the interest embedding to remain semantically recoverable. On Taobao search system, KARMA mitigates semantic collapse (attention-sink analysis) and improves both action metrics and semantic fidelity. In ablations, semantic decodability yields up to +22.5 HR@200. With KARMA, we achieve +0.25 CTR AUC in ranking, +1.86 HR in pre-ranking and +2.51 HR in recalling. Deployed online with low inference overhead at ranking stage, KARMA drives +0.5% increase in Item Click.

CVDec 11, 2023
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

Fangfu Liu, Diankun Wu, Yi Wei et al.

Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.

LGOct 28, 2023
The Evolution of the Interplay Between Input Distributions and Linear Regions in Networks

Xuan Qi, Yi Wei

It is commonly recognized that the expressiveness of deep neural networks is contingent upon a range of factors, encompassing their depth, width, and other relevant considerations. Currently, the practical performance of the majority of deep neural networks remains uncertain. For ReLU (Rectified Linear Unit) networks with piecewise linear activations, the number of linear convex regions serves as a natural metric to gauge the network's expressivity. In this paper, we count the number of linear convex regions in deep neural networks based on ReLU. In particular, we prove that for any one-dimensional input, there exists a minimum threshold for the number of neurons required to express it. We also empirically observe that for the same network, intricate inputs hinder its capacity to express linear regions. Furthermore, we unveil the iterative refinement process of decision boundaries in ReLU networks during training. We aspire for our research to serve as an inspiration for network optimization endeavors and aids in the exploration and analysis of the behaviors exhibited by deep networks.

CVDec 14, 2023
OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments

Chubin Zhang, Juncheng Yan, Yi Wei et al.

Occupancy prediction reconstructs 3D structures of surrounding environments. It provides detailed information for autonomous driving planning and navigation. However, most existing methods heavily rely on the LiDAR point clouds to generate occupancy ground truth, which is not available in the vision-based system. In this paper, we propose an OccNeRF method for training occupancy networks without 3D supervision. Different from previous works which consider a bounded scene, we parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range. The neural rendering is adopted to convert occupancy fields to multi-camera depth maps, supervised by multi-frame photometric consistency. Moreover, for semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model. Extensive experiments for both self-supervised depth estimation and 3D occupancy prediction tasks on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our method.

LGDec 9, 2024
Balancing Efficiency and Effectiveness: An LLM-Infused Approach for Optimized CTR Prediction

Guoxiao Zhang, Yi Wei, Yadong Zhang et al.

Click-Through Rate (CTR) prediction is essential in online advertising, where semantic information plays a pivotal role in shaping user decisions and enhancing CTR effectiveness. Capturing and modeling deep semantic information, such as a user's preference for "Häagen-Dazs' HEAVEN strawberry light ice cream" due to its health-conscious and premium attributes, is challenging. Traditional semantic modeling often overlooks these intricate details at the user and item levels. To bridge this gap, we introduce a novel approach that models deep semantic information end-to-end, leveraging the comprehensive world knowledge capabilities of Large Language Models (LLMs). Our proposed LLM-infused CTR prediction framework(Multi-level Deep Semantic Information Infused CTR model via Distillation, MSD) is designed to uncover deep semantic insights by utilizing LLMs to extract and distill critical information into a smaller, more efficient model, enabling seamless end-to-end training and inference. Importantly, our framework is carefully designed to balance efficiency and effectiveness, ensuring that the model not only achieves high performance but also operates with optimal resource utilization. Online A/B tests conducted on the Meituan sponsored-search system demonstrate that our method significantly outperforms baseline models in terms of Cost Per Mile (CPM) and CTR, validating its effectiveness, scalability, and balanced approach in real-world applications.

CVMay 29, 2025
Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

Zheng Sun, Yi Wei, Long Yu

Multimodal Large Language Models (MLLMs) are of great application across many domains, such as multimodal understanding and generation. With the development of diffusion models (DM) and unified MLLMs, the performance of image generation has been significantly improved, however, the study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data and the week image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive medical image screening dataset with 1500+ samples, each sample consists of a medical image, four generated images, and a multiple-choice answer. The dataset evaluates the aesthetic reasoning ability under four aspects: \textit{(1) Appearance Deformation, (2) Principles of Physical Lighting and Shadow, (3) Placement Layout, (4) Extension Rationality}. For methodology, we utilize long chains of thought (CoT) and Group Relative Policy Optimization with Dynamic Proportional Accuracy reward, called DPA-GRPO, to enhance the image aesthetic reasoning ability of MLLMs. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT-4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the reinforcement learning approach, we are able to surpass the score of both large-scale models and leading closed-source models using a much smaller model. We hope our attempt on medical image screening will serve as a regular configuration in image aesthetic reasoning in the future.

LGOct 31, 2024
RAM: Replace Attention with MLP for Efficient Multivariate Time Series Forecasting

Suhan Guo, Jiahong Deng, Yi Wei et al.

Attention-based architectures have become ubiquitous in time series forecasting tasks, including spatio-temporal (STF) and long-term time series forecasting (LTSF). Yet, our understanding of the reasons for their effectiveness remains limited. In this work, we propose a novel pruning strategy, $\textbf{R}$eplace $\textbf{A}$ttention with $\textbf{M}$LP (RAM), that approximates the attention mechanism using only feedforward layers, residual connections, and layer normalization for temporal and/or spatial modeling in multivariate time series forecasting. Specifically, the Q, K, and V projections, the attention score calculation, the dot-product between the attention score and the V, and the final projection can be removed from the attention-based networks without significantly degrading the performance, so that the given network remains the top-tier compared to other SOTA methods. RAM achieves a $62.579\%$ reduction in FLOPs for spatio-temporal models with less than $2.5\%$ performance drop, and a $42.233\%$ FLOPs reduction for LTSF models with less than $2\%$ performance drop.

CVOct 18, 2025
DiffusionX: Efficient Edge-Cloud Collaborative Image Generation with Multi-Round Prompt Evolution

Yi Wei, Shunpu Tang, Liang Zhao et al.

Recent advances in diffusion models have driven remarkable progress in image generation. However, the generation process remains computationally intensive, and users often need to iteratively refine prompts to achieve the desired results, further increasing latency and placing a heavy burden on cloud resources. To address this challenge, we propose DiffusionX, a cloud-edge collaborative framework for efficient multi-round, prompt-based generation. In this system, a lightweight on-device diffusion model interacts with users by rapidly producing preview images, while a high-capacity cloud model performs final refinements after the prompt is finalized. We further introduce a noise level predictor that dynamically balances the computation load, optimizing the trade-off between latency and cloud workload. Experiments show that DiffusionX reduces average generation time by 15.8% compared with Stable Diffusion v1.5, while maintaining comparable image quality. Moreover, it is only 0.9% slower than Tiny-SD with significantly improved image quality, thereby demonstrating efficiency and scalability with minimal overhead.

CVJun 11, 2025
Vision Generalist Model: A Survey

Ziyi Wang, Yongming Rao, Shuofeng Sun et al. · tsinghua

Recently, we have witnessed the great success of the generalist model in natural language processing. The generalist model is a general framework trained with massive data and is able to process various downstream tasks simultaneously. Encouraged by their impressive performance, an increasing number of researchers are venturing into the realm of applying these models to computer vision tasks. However, the inputs and outputs of vision tasks are more diverse, and it is difficult to summarize them as a unified representation. In this paper, we provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field. First, we review the background, including the datasets, tasks, and benchmarks. Then, we dig into the design of frameworks that have been proposed in existing research, while also introducing the techniques employed to enhance their performance. To better help the researchers comprehend the area, we take a brief excursion into related domains, shedding light on their interconnections and potential synergies. To conclude, we provide some real-world application scenarios, undertake a thorough examination of the persistent challenges, and offer insights into possible directions for future research endeavors.

CVMay 13, 2025
HMPNet: A Feature Aggregation Architecture for Maritime Object Detection from a Shipborne Perspective

Yu Zhang, Fengyuan Liu, Juan Lyu et al.

In the realm of intelligent maritime navigation, object detection from a shipborne perspective is paramount. Despite the criticality, the paucity of maritime-specific data impedes the deployment of sophisticated visual perception techniques, akin to those utilized in autonomous vehicular systems, within the maritime context. To bridge this gap, we introduce Navigation12, a novel dataset annotated for 12 object categories under diverse maritime environments and weather conditions. Based upon this dataset, we propose HMPNet, a lightweight architecture tailored for shipborne object detection. HMPNet incorporates a hierarchical dynamic modulation backbone to bolster feature aggregation and expression, complemented by a matrix cascading poly-scale neck and a polymerization weight sharing detector, facilitating efficient multi-scale feature aggregation. Empirical evaluations indicate that HMPNet surpasses current state-of-the-art methods in terms of both accuracy and computational efficiency, realizing a 3.3% improvement in mean Average Precision over YOLOv11n, the prevailing model, and reducing parameters by 23%.

CVJun 21, 2024
GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation

Chubin Zhang, Hongliang Song, Yi Wei et al.

In this work, we introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory. Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images. This limits these methods to a low-resolution representation and makes it difficult to scale up to the dense views for better quality. GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms to effectively integrate image features into 3D representations. We implement this solution through a two-stage pipeline: initially, a lightweight proposal network generates a sparse set of 3D anchor points from the posed image inputs; subsequently, a specialized reconstruction transformer refines the geometry and retrieves textural details. Extensive experimental results demonstrate that GeoLRM significantly outperforms existing models, especially for dense view inputs. We also demonstrate the practical applicability of our model with 3D generation tasks, showcasing its versatility and potential for broader adoption in real-world applications. The project page: https://linshan-bin.github.io/GeoLRM/.

LGMay 25, 2023
Stecformer: Spatio-temporal Encoding Cascaded Transformer for Multivariate Long-term Time Series Forecasting

Zheng Sun, Yi Wei, Wenxiao Jia et al.

Multivariate long-term time series forecasting is of great application across many domains, such as energy consumption and weather forecasting. With the development of transformer-based methods, the performance of multivariate long-term time series forecasting has been significantly improved, however, the study of spatial features extracting in transformer-based model is rare and the consistency of different prediction periods is unsatisfactory due to the large span. In this work, we propose a complete solution to address these problems in terms of feature extraction and target prediction. For extraction, we design an efficient spatio-temporal encoding extractor including a semi-adaptive graph to acquire sufficient spatio-temporal information. For prediction, we propose a Cascaded Decoding Predictor (CDP) to strengthen the correlation between different intervals, which can also be utilized as a generic component to improve the performance of transformer-based methods. The proposed method, termed as Spatio-temporal Encoding Cascaded Transformer (Stecformer), achieving a notable gap over the baseline model and is comparable with the state-of-the-art performance of transformer-based methods on five benchmark datasets. We hope our attempt will serve as a regular configuration in multivariate long-term time series forecasting in the future.

IRDec 27, 2021
Mind the Gap: Cross-Lingual Information Retrieval with Hierarchical Knowledge Enhancement

Fuwei Zhang, Zhao Zhang, Xiang Ao et al.

Cross-Lingual Information Retrieval (CLIR) aims to rank the documents written in a language different from the user's query. The intrinsic gap between different languages is an essential challenge for CLIR. In this paper, we introduce the multilingual knowledge graph (KG) to the CLIR task due to the sufficient information of entities in multiple languages. It is regarded as a "silver bullet" to simultaneously perform explicit alignment between queries and documents and also broaden the representations of queries. And we propose a model named CLIR with hierarchical knowledge enhancement (HIKE) for our task. The proposed model encodes the textual information in queries, documents and the KG with multilingual BERT, and incorporates the KG information in the query-document matching process with a hierarchical information fusion mechanism. Particularly, HIKE first integrates the entities and their neighborhood in KG into query representations with a knowledge-level fusion, then combines the knowledge from both source and target languages to further mitigate the linguistic gap with a language-level fusion. Finally, experimental results demonstrate that HIKE achieves substantial improvements over state-of-the-art competitors.

CVAug 17, 2021
RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection

Yongming Rao, Benlin Liu, Yi Wei et al.

3D point cloud understanding has made great progress in recent years. However, one major bottleneck is the scarcity of annotated real datasets, especially compared to 2D object detection tasks, since a large amount of labor is involved in annotating the real scans of a scene. A promising solution to this problem is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets. This can be achieved by the pre-training and fine-tuning procedure. However, recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications. In this work, we put forward a new method called RandomRooms to accomplish this objective. In particular, we propose to generate random layouts of a scene by making use of the objects in the synthetic CAD dataset and learn the 3D scene representation by applying object-level contrastive learning on two random scenes generated from the same set of synthetic objects. The model pre-trained in this way can serve as a better initialization when later fine-tuning on the 3D object detection task. Empirically, we show consistent improvement in downstream 3D detection tasks on several base models, especially when less training data are used, which strongly demonstrates the effectiveness and generalization of our method. Benefiting from the rich semantic knowledge and diverse objects from synthetic data, our method establishes the new state-of-the-art on widely-used 3D detection benchmarks ScanNetV2 and SUN RGB-D. We expect our attempt to provide a new perspective for bridging object and scene-level 3D understanding.

CVMay 17, 2021
FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection

Yi Wei, Shang Su, Jiwen Lu et al.

In this paper, we investigate the problem of weakly supervised 3D vehicle detection. Conventional methods for 3D object detection need vast amounts of manually labelled 3D data as supervision signals. However, annotating large datasets requires huge human efforts, especially for 3D area. To tackle this problem, we propose frustum-aware geometric reasoning (FGR) to detect vehicles in point clouds without any 3D annotations. Our method consists of two stages: coarse 3D segmentation and 3D bounding box estimation. For the first stage, a context-aware adaptive region growing algorithm is designed to segment objects based on 2D bounding boxes. Leveraging predicted segmentation masks, we develop an anti-noise approach to estimate 3D bounding boxes in the second stage. Finally 3D pseudo labels generated by our method are utilized to train a 3D detector. Independent of any 3D groundtruth, FGR reaches comparable performance with fully supervised methods on the KITTI dataset. The findings indicate that it is able to accurately detect objects in 3D space with only 2D bounding boxes and sparse point clouds.

CVMar 1, 2021
DST: Data Selection and joint Training for Learning with Noisy Labels

Yi Wei, Xue Mei, Xin Liu et al.

Training a deep neural network heavily relies on a large amount of training data with accurate annotations. To alleviate this problem, various methods have been proposed to annotate the data automatically. However, automatically generating annotations will inevitably yields noisy labels. In this paper, we propose a Data Selection and joint Training (DST) method to automatically select training samples with accurate annotations. Specifically, DST fits a mixture model according to the original annotation as well as the predicted label for each training sample, and the mixture model is utilized to dynamically divide the training dataset into a correctly labeled dataset, a correctly predicted set and a wrong dataset. Then, DST is trained with these datasets in a supervised manner. Due to confirmation bias problem, we train the two networks alternately, and each network is tasked to establish the data division to teach another network. For each iteration, the correctly labeled and predicted labels are reweighted respectively by the probabilities from the mixture model, and a uniform distribution is used to generate the probabilities of the wrong samples. Experiments on CIFAR-10, CIFAR-100 and Clothing1M demonstrate that DST is the comparable or superior to the state-of-the-art methods.

LGFeb 12, 2021
Multiplex Bipartite Network Embedding using Dual Hypergraph Convolutional Networks

Hansheng Xue, Luwei Yang, Vaibhav Rajan et al.

A bipartite network is a graph structure where nodes are from two distinct domains and only inter-domain interactions exist as edges. A large number of network embedding methods exist to learn vectorial node representations from general graphs with both homogeneous and heterogeneous node and edge types, including some that can specifically model the distinct properties of bipartite networks. However, these methods are inadequate to model multiplex bipartite networks (e.g., in e-commerce), that have multiple types of interactions (e.g., click, inquiry, and buy) and node attributes. Most real-world multiplex bipartite networks are also sparse and have imbalanced node distributions that are challenging to model. In this paper, we develop an unsupervised Dual HyperGraph Convolutional Network (DualHGCN) model that scalably transforms the multiplex bipartite network into two sets of homogeneous hypergraphs and uses spectral hypergraph convolutional operators, along with intra- and inter-message passing strategies to promote information exchange within and across domains, to learn effective node embedding. We benchmark DualHGCN using four real-world datasets on link prediction and node classification tasks. Our extensive experiments demonstrate that DualHGCN significantly outperforms state-of-the-art methods, and is robust to varying sparsity levels and imbalanced node distributions.

IRJan 22, 2021
Network Clustering for Multi-task Learning

Dehong Gao, Wenjing Yang, Huiling Zhou et al.

The Multi-Task Learning (MTL) technique has been widely studied by word-wide researchers. The majority of current MTL studies adopt the hard parameter sharing structure, where hard layers tend to learn general representations over all tasks and specific layers are prone to learn specific representations for each task. Since the specific layers directly follow the hard layers, the MTL model needs to estimate this direct change (from general to specific) as well. To alleviate this problem, we introduce the novel cluster layer, which groups tasks into clusters during training procedures. In a cluster layer, the tasks in the same cluster are further required to share the same network. By this way, the cluster layer produces the general presentation for the same cluster, while produces relatively specific presentations for different clusters. As transitions the cluster layers are used between the hard layers and the specific layers. The MTL model thus learns general representations to specific representations gradually. We evaluate our model with MTL document classification and the results demonstrate the cluster layer is quite efficient in MTL.

STJan 10, 2021
Absolute Value Constraint: The Reason for Invalid Performance Evaluation Results of Neural Network Models for Stock Price Prediction

Yi Wei

Neural networks for stock price prediction(NNSPP) have been popular for decades. However, most of its study results remain in the research paper and cannot truly play a role in the securities market. One of the main reasons leading to this situation is that the prediction error(PE) based evaluation results have statistical flaws. Its prediction results cannot represent the most critical financial direction attributes. So it cannot provide investors with convincing, interpretable, and consistent model performance evaluation results for practical applications in the securities market. To illustrate, we have used data selected from 20 stock datasets over six years from the Shanghai and Shenzhen stock market in China, and 20 stock datasets from NASDAQ and NYSE in the USA. We implement six shallow and deep neural networks to predict stock prices and use four prediction error measures for evaluation. The results show that the prediction error value only partially reflects the model accuracy of the stock price prediction, and cannot reflect the change in the direction of the model predicted stock price. This characteristic determines that PE is not suitable as an evaluation indicator of NNSPP. Otherwise, it will bring huge potential risks to investors. Therefore, this paper establishes an experiment platform to confirm that the PE method is not suitable for the NNSPP evaluation, and provides a theoretical basis for the necessity of creating a new NNSPP evaluation method in the future.

CVDec 2, 2020
PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds

Yi Wei, Ziyi Wang, Yongming Rao et al.

In this paper, we propose a Point-Voxel Recurrent All-Pairs Field Transforms (PV-RAFT) method to estimate scene flow from point clouds. Since point clouds are irregular and unordered, it is challenging to efficiently extract features from all-pairs fields in the 3D space, where all-pairs correlations play important roles in scene flow estimation. To tackle this problem, we present point-voxel correlation fields, which capture both local and long-range dependencies of point pairs. To capture point-based correlations, we adopt the K-Nearest Neighbors search that preserves fine-grained information in the local region. By voxelizing point clouds in a multi-scale manner, we construct pyramid correlation voxels to model long-range correspondences. Integrating these two types of correlations, our PV-RAFT makes use of all-pairs relations to handle both small and large displacements. We evaluate the proposed method on the FlyingThings3D and KITTI Scene Flow 2015 datasets. Experimental results show that PV-RAFT outperforms state-of-the-art methods by remarkable margins.

CVOct 3, 2020
MagGAN: High-Resolution Face Attribute Editing with Mask-Guided Generative Adversarial Network

Yi Wei, Zhe Gan, Wenbo Li et al.

We present Mask-guided Generative Adversarial Network (MagGAN) for high-resolution face attribute editing, in which semantic facial masks from a pre-trained face parser are used to guide the fine-grained image editing process. With the introduction of a mask-guided reconstruction loss, MagGAN learns to only edit the facial parts that are relevant to the desired attribute changes, while preserving the attribute-irrelevant regions (e.g., hat, scarf for modification `To Bald'). Further, a novel mask-guided conditioning strategy is introduced to incorporate the influence region of each attribute change into the generator. In addition, a multi-level patch-wise discriminator structure is proposed to scale our model for high-resolution ($1024 \times 1024$) face editing. Experiments on the CelebA benchmark show that the proposed method significantly outperforms prior state-of-the-art approaches in terms of both image quality and editing performance.

SPJun 11, 2020
A PDD Decoder for Binary Linear Codes With Neural Check Polytope Projection

Yi Wei, Ming-Min Zhao, Min-Jian Zhao et al.

Linear Programming (LP) is an important decoding technique for binary linear codes. However, the advantages of LP decoding, such as low error floor and strong theoretical guarantee, etc., come at the cost of high computational complexity and poor performance at the low signal-to-noise ratio (SNR) region. In this letter, we adopt the penalty dual decomposition (PDD) framework and propose a PDD algorithm to address the fundamental polytope based maximum likelihood (ML) decoding problem. Furthermore, we propose to integrate machine learning techniques into the most time-consuming part of the PDD decoding algorithm, i.e., check polytope projection (CPP). Inspired by the fact that a multi-layer perception (MLP) can theoretically approximate any nonlinear mapping function, we present a specially designed neural CPP (NCPP) algorithm to decrease the decoding latency. Simulation results demonstrate the effectiveness of the proposed algorithms.

IRMay 20, 2020
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

Dehong Gao, Linbo Jin, Ben Chen et al.

In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry. Different from the matching in the general domain, the fashion matching is required to pay much more attention to the fine-grained information in the fashion images and texts. Pioneer approaches detect the region of interests (i.e., RoIs) from images and use the RoI embeddings as image representations. In general, RoIs tend to represent the "object-level" information in the fashion images, while fashion texts are prone to describe more detailed information, e.g. styles, attributes. RoIs are thus not fine-grained enough for fashion text and image matching. To this end, we propose FashionBERT, which leverages patches as image features. With the pre-trained BERT model as the backbone network, FashionBERT learns high level representations of texts and images. Meanwhile, we propose an adaptive loss to trade off multitask learning in the FashionBERT modeling. Two tasks (i.e., text and image matching and cross-modal retrieval) are incorporated to evaluate FashionBERT. On the public dataset, experiments demonstrate FashionBERT achieves significant improvements in performances than the baseline and state-of-the-art approaches. In practice, FashionBERT is applied in a concrete cross-modal retrieval application. We provide the detailed matching performance and inference efficiency analysis.

IRMay 14, 2020
Deep Hierarchical Classification for Category Prediction in E-commerce System

Dehong Gao, Wenjing Yang, Huiling Zhou et al.

In e-commerce system, category prediction is to automatically predict categories of given texts. Different from traditional classification where there are no relations between classes, category prediction is reckoned as a standard hierarchical classification problem since categories are usually organized as a hierarchical tree. In this paper, we address hierarchical category prediction. We propose a Deep Hierarchical Classification framework, which incorporates the multi-scale hierarchical information in neural networks and introduces a representation sharing strategy according to the category tree. We also define a novel combined loss function to punish hierarchical prediction losses. The evaluation shows that the proposed approach outperforms existing approaches in accuracy.

SIApr 1, 2020
Modeling Dynamic Heterogeneous Network for Link Prediction using Hierarchical Attention with Temporal RNN

Hansheng Xue, Luwei Yang, Wen Jiang et al.

Network embedding aims to learn low-dimensional representations of nodes while capturing structure information of networks. It has achieved great success on many tasks of network analysis such as link prediction and node classification. Most of existing network embedding algorithms focus on how to learn static homogeneous networks effectively. However, networks in the real world are more complex, e.g., networks may consist of several types of nodes and edges (called heterogeneous information) and may vary over time in terms of dynamic nodes and edges (called evolutionary patterns). Limited work has been done for network embedding of dynamic heterogeneous networks as it is challenging to learn both evolutionary and heterogeneous information simultaneously. In this paper, we propose a novel dynamic heterogeneous network embedding method, termed as DyHATR, which uses hierarchical attention to learn heterogeneous information and incorporates recurrent neural networks with temporal attention to capture evolutionary patterns. We benchmark our method on four real-world datasets for the task of link prediction. Experimental results show that DyHATR significantly outperforms several state-of-the-art baselines.

ITFeb 14, 2020
ADMM-based Decoder for Binary Linear Codes Aided by Deep Learning

Yi Wei, Ming-Min Zhao, Min-Jian Zhao et al.

Inspired by the recent advances in deep learning (DL), this work presents a deep neural network aided decoding algorithm for binary linear codes. Based on the concept of deep unfolding, we design a decoding network by unfolding the alternating direction method of multipliers (ADMM)-penalized decoder. In addition, we propose two improved versions of the proposed network. The first one transforms the penalty parameter into a set of iteration-dependent ones, and the second one adopts a specially designed penalty function, which is based on a piecewise linear function with adjustable slopes. Numerical results show that the resulting DL-aided decoders outperform the original ADMM-penalized decoder for various low density parity check (LDPC) codes with similar computational complexity.

SPJun 10, 2019
Learned Conjugate Gradient Descent Network for Massive MIMO Detection

Yi Wei, Ming-Min Zhao, Mingyi Hong et al.

In this work, we consider the use of model-driven deep learning techniques for massive multiple-input multiple-output (MIMO) detection. Compared with conventional MIMO systems, massive MIMO promises improved spectral efficiency, coverage and range. Unfortunately, these benefits are coming at the cost of significantly increased computational complexity. To reduce the complexity of signal detection and guarantee the performance, we present a learned conjugate gradient descent network (LcgNet), which is constructed by unfolding the iterative conjugate gradient descent (CG) detector. In the proposed network, instead of calculating the exact values of the scalar step-sizes, we explicitly learn their universal values. Also, we can enhance the proposed network by augmenting the dimensions of these step-sizes. Furthermore, in order to reduce the memory costs, a novel quantized LcgNet is proposed, where a low-resolution nonuniform quantizer is integrated into the LcgNet to smartly quantize the aforementioned step-sizes. The quantizer is based on a specially designed soft staircase function with learnable parameters to adjust its shape. Meanwhile, due to fact that the number of learnable parameters is limited, the proposed networks are easy and fast to train. Numerical results demonstrate that the proposed network can achieve promising performance with much lower complexity.

CVApr 14, 2019
Conditional Single-view Shape Generation for Multi-view Stereo Reconstruction

Yi Wei, Shaohui Liu, Wang Zhao et al.

In this paper, we present a new perspective towards image-based shape generation. Most existing deep learning based shape reconstruction methods employ a single-view deterministic model which is sometimes insufficient to determine a single groundtruth shape because the back part is occluded. In this work, we first introduce a conditional generative network to model the uncertainty for single-view reconstruction. Then, we formulate the task of multi-view reconstruction as taking the intersection of the predicted shape spaces on each single image. We design new differentiable guidance including the front constraint, the diversity constraint, and the consistency loss to enable effective single-view conditional generation and multi-view synthesis. Experimental results and ablation studies show that our proposed approach outperforms state-of-the-art methods on 3D reconstruction test error and demonstrate its generalization ability on real world data.

CVMay 6, 2018
Quantization Mimic: Towards Very Tiny CNN for Object Detection

Yi Wei, Xinyu Pan, Hongwei Qin et al.

In this paper, we propose a simple and general framework for training very tiny CNNs for object detection. Due to limited representation ability, it is challenging to train very tiny networks for complicated tasks like detection. To the best of our knowledge, our method, called Quantization Mimic, is the first one focusing on very tiny networks. We utilize two types of acceleration methods: mimic and quantization. Mimic improves the performance of a student network by transfering knowledge from a teacher network. Quantization converts a full-precision network to a quantized one without large degradation of performance. If the teacher network is quantized, the search scope of the student network will be smaller. Using this feature of the quantization, we propose Quantization Mimic. It first quantizes the large network, then mimic a quantized small network. The quantization operation can help student network to better match the feature maps from teacher network. To evaluate our approach, we carry out experiments on various popular CNNs including VGG and Resnet, as well as different detection frameworks including Faster R-CNN and R-FCN. Experiments on Pascal VOC and WIDER FACE verify that our Quantization Mimic algorithm can be applied on various settings and outperforms state-of-the-art model acceleration methods given limited computing resouces.

CVApr 26, 2018
Two-Stream Binocular Network: Accurate Near Field Finger Detection Based On Binocular Images

Yi Wei, Guijin Wang, Cairong Zhang et al.

Fingertip detection plays an important role in human computer interaction. Previous works transform binocular images into depth images. Then depth-based hand pose estimation methods are used to predict 3D positions of fingertips. Different from previous works, we propose a new framework, named Two-Stream Binocular Network (TSBnet) to detect fingertips from binocular images directly. TSBnet first shares convolutional layers for low level features of right and left images. Then it extracts high level features in two-stream convolutional networks separately. Further, we add a new layer: binocular distance measurement layer to improve performance of our model. To verify our scheme, we build a binocular hand image dataset, containing about 117k pairs of images in training set and 10k pairs of images in test set. Our methods achieve an average error of 10.9mm on our test set, outperforming previous work by 5.9mm (relatively 35.1%).

CVMar 20, 2018
An Improved Evaluation Framework for Generative Adversarial Networks

Shaohui Liu, Yi Wei, Jiwen Lu et al.

In this paper, we propose an improved quantitative evaluation framework for Generative Adversarial Networks (GANs) on generating domain-specific images, where we improve conventional evaluation methods on two levels: the feature representation and the evaluation metric. Unlike most existing evaluation frameworks which transfer the representation of ImageNet inception model to map images onto the feature space, our framework uses a specialized encoder to acquire fine-grained domain-specific representation. Moreover, for datasets with multiple classes, we propose Class-Aware Frechet Distance (CAFD), which employs a Gaussian mixture model on the feature space to better fit the multi-manifold feature distribution. Experiments and analysis on both the feature level and the image level were conducted to demonstrate improvements of our proposed framework over the recently proposed state-of-the-art FID method. To our best knowledge, we are the first to provide counter examples where FID gives inconsistent results with human judgments. It is shown in the experiments that our framework is able to overcome the shortness of FID and improves robustness. Code will be made available.

SEMar 5, 2014
Automated Fixing of Programs with Contracts

Yu Pei, Carlo A. Furia, Martin Nordio et al.

This paper describes AutoFix, an automatic debugging technique that can fix faults in general-purpose software. To provide high-quality fix suggestions and to enable automation of the whole debugging process, AutoFix relies on the presence of simple specification elements in the form of contracts (such as pre- and postconditions). Using contracts enhances the precision of dynamic analysis techniques for fault detection and localization, and for validating fixes. The only required user input to the AutoFix supporting tool is then a faulty program annotated with contracts; the tool produces a collection of validated fixes for the fault ranked according to an estimate of their suitability. In an extensive experimental evaluation, we applied AutoFix to over 200 faults in four code bases of different maturity and quality (of implementation and of contracts). AutoFix successfully fixed 42% of the faults, producing, in the majority of cases, corrections of quality comparable to those competent programmers would write; the used computational resources were modest, with an average time per fix below 20 minutes on commodity hardware. These figures compare favorably to the state of the art in automated program fixing, and demonstrate that the AutoFix approach is successfully applicable to reduce the debugging burden in real-world scenarios.

SEAug 16, 2012
What Good Are Strong Specifications?

Nadia Polikarpova, Carlo A. Furia, Yu Pei et al.

Experience with lightweight formal methods suggests that programmers are willing to write specification if it brings tangible benefits to their usual development activities. This paper considers stronger specifications and studies whether they can be deployed as an incremental practice that brings additional benefits without being unacceptably expensive. We introduce a methodology that extends Design by Contract to write strong specifications of functional properties in the form of preconditions, postconditions, and invariants. The methodology aims at being palatable to developers who are not fluent in formal techniques but are comfortable with writing simple specifications. We evaluate the cost and the benefits of using strong specifications by applying the methodology to testing data structure implementations written in Eiffel and C#. In our extensive experiments, testing against strong specifications detects twice as many bugs as standard contracts, with a reasonable overhead in terms of annotation burden and run-time performance while testing. In the wide spectrum of formal techniques for software quality, testing against strong specifications lies in a "sweet spot" with a favorable benefit to effort ratio.