CVMay 11, 2022
NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and ResultsYawei Li, Kai Zhang, Radu Timofte et al. · eth-zurich, tencent-ai
This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set. IMDN is set as the baseline for efficiency measurement. The challenge had 3 tracks including the main track (runtime), sub-track one (model complexity), and sub-track two (overall performance). In the main track, the practical runtime performance of the submissions was evaluated. The rank of the teams were determined directly by the absolute value of the average runtime on the validation set and test set. In sub-track one, the number of parameters and FLOPs were considered. And the individual rankings of the two metrics were summed up to determine a final ranking in this track. In sub-track two, all of the five metrics mentioned in the description of the challenge including runtime, parameter count, FLOPs, activations, and memory consumption were considered. Similar to sub-track one, the rankings of five metrics were summed up to determine a final ranking. The challenge had 303 registered participants, and 43 teams made valid submissions. They gauge the state-of-the-art in efficient single image super-resolution.
81.2CVMay 21Code
AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in AgricultureZi Ye, Yibin Wen, Xiaoya Fan et al.
Agricultural decision-making increasingly requires multimodal systems that can transform visual observations into reliable, executable actions. However, existing agricultural multimodal benchmarks mainly evaluate final-answer correctness and provide limited support for assessing whether models can use external tools to complete precision-sensitive workflows. In this paper, we introduce AgroTools, a benchmark for evaluating tool-augmented multimodal agents in agriculture. AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query is annotated with structured tool-use traces, enabling a dual-view evaluation of both process-level execution quality and outcome-level task success. We benchmark 9 open-source and 4 closed-source multimodal large language models on AgroTools. Results show that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis. We hope AgroTools will support future research on multimodal agents for high-precision agricultural applications. The benchmark and evaluation are available at https://huggingface.co/datasets/AgroTools/AgroTools.
31.9CLMay 16Code
Exploring Lightweight Large Language Models for Court View GenerationZhitian Hou, Tianyong Hao, Nanli Zeng et al.
Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade-offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \url{https://github.com/ZhitianHou/CVGEvalKit}
CVNov 9, 2023
FireMatch: A Semi-Supervised Video Fire Detection Network Based on Consistency and Distribution AlignmentQinghua Lin, Zuoyong Li, Kun Zeng et al.
Deep learning techniques have greatly enhanced the performance of fire detection in videos. However, video-based fire detection models heavily rely on labeled data, and the process of data labeling is particularly costly and time-consuming, especially when dealing with videos. Considering the limited quantity of labeled video data, we propose a semi-supervised fire detection model called FireMatch, which is based on consistency regularization and adversarial distribution alignment. Specifically, we first combine consistency regularization with pseudo-label. For unlabeled data, we design video data augmentation to obtain corresponding weakly augmented and strongly augmented samples. The proposed model predicts weakly augmented samples and retains pseudo-label above a threshold, while training on strongly augmented samples to predict these pseudo-labels for learning more robust feature representations. Secondly, we generate video cross-set augmented samples by adversarial distribution alignment to expand the training data and alleviate the decline in classification performance caused by insufficient labeled data. Finally, we introduce a fairness loss to help the model produce diverse predictions for input samples, thereby addressing the issue of high confidence with the non-fire class in fire classification scenarios. The FireMatch achieved an accuracy of 76.92% and 91.81% on two real-world fire datasets, respectively. The experimental results demonstrate that the proposed method outperforms the current state-of-the-art semi-supervised classification methods.
83.6SEApr 17
RepoShapley: Shapley-Enhanced Context Filtering for Repository-Level Code CompletionYu Huo, Kun Zeng, Siyu Zhang et al.
Repository-level code completion benefits from retrieval-augmented generation (RAG). However, controlling cross-file evidence is difficult because chunk utility is often interaction-dependent: some snippets help only when paired with complementary context, while others harm decoding when they conflict. We propose RepoShapley, a coalition-aware context filtering framework supervised by Shapley-style marginal contributions. Our offline labeling module, ChunkShapley, estimates signed per-chunk effects via teacher-forced probing, feeds them into a lightweight surrogate game that captures saturation and interference, computes exact Shapley values for small retrieval sets, and selects a decoding-optimal coalition through bounded post-verification with the frozen generator. The verified keep/drop decisions and retrieval triggers are then distilled into a single model via discrete control tokens. Experiments across benchmarks and backbones show that RepoShapley improves completion quality while reducing harmful context and unnecessary retrieval.
CLSep 12, 2025Code
Large Language Models Meet Legal Artificial Intelligence: A SurveyZhitian Hou, Zihan Ye, Nanli Zeng et al.
Large Language Models (LLMs) have significantly advanced the development of Legal Artificial Intelligence (Legal AI) in recent years, enhancing the efficiency and accuracy of legal tasks. To advance research and applications of LLM-based approaches in legal domain, this paper provides a comprehensive review of 16 legal LLMs series and 47 LLM-based frameworks for legal tasks, and also gather 15 benchmarks and 29 datasets to evaluate different legal capabilities. Additionally, we analyse the challenges and discuss future directions for LLM-based approaches in the legal domain. We hope this paper provides a systematic introduction for beginners and encourages future research in this field. Resources are available at https://github.com/ZhitianHou/LLMs4LegalAI.
CVJan 28Code
Shape of Thought: Progressive Object Assembly via Visual Chain-of-ThoughtYu Huo, Siyu Zhang, Kun Zeng et al.
Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at https://anonymous.4open.science/r/16FE/. The SoT-26K dataset will be released upon acceptance.
CLOct 10, 2025Code
ShiZhi: A Chinese Lightweight Large Language Model for Court View GenerationZhitian Hou, Kun Zeng
Criminal Court View Generation (CVG) is a fundamental task in legal artificial intelligence, aiming to automatically generate the "Court View" section of a legal case document. Generating court views is challenging due to the diversity and complexity of case facts, and directly generating from raw facts may limit performance. In this paper, we present ShiZhi, the first large language model (LLM) specifically designed for court view generation. We construct a Chinese Court View Generation dataset, CCVG, of more than 110K cases, each containing fact descriptions paired with corresponding court views. Based on this dataset, ShiZhi achieving 70.00 ROUGE-1 and 67.85 BLEU-1 on court view generation, as well as 86.48\% accuracy with 92.75\% macro F1 on charge prediction. Experimental results demonstrate that even a small LLM can generate reasonable and legally coherent court views when trained on high-quality domain-specific data. Our model and dataset are available at \href{https://github.com/ZhitianHou/ShiZhi}{https://github.com/ZhitianHou/ShiZhi}.
LGJun 12, 2021Code
Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descentKun Zeng, Jinlan Liu, Zhixia Jiang et al.
The plain stochastic gradient descent and momentum stochastic gradient descent have extremely wide applications in deep learning due to their simple settings and low computational complexity. The momentum stochastic gradient descent uses the accumulated gradient as the updated direction of the current parameters, which has a faster training speed. Because the direction of the plain stochastic gradient descent has not been corrected by the accumulated gradient. For the parameters that currently need to be updated, it is the optimal direction, and its update is more accurate. We combine the advantages of the momentum stochastic gradient descent with fast training speed and the plain stochastic gradient descent with high accuracy, and propose a scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent(TSGD) method. At the same time, a learning rate that decreases linearly with the iterations is used instead of a constant learning rate. The TSGD algorithm has a larger step size in the early stage to speed up the training, and training with a smaller step size in the later stage can steadily converge. Our experimental results show that the TSGD algorithm has faster training speed, higher accuracy and better stability. Our implementation is available at: https://github.com/kunzeng/TSGD.
LGJun 12, 2021Code
A decreasing scaling transition scheme from Adam to SGDKun Zeng, Jinlan Liu, Zhixia Jiang et al.
Adaptive gradient algorithm (AdaGrad) and its variants, such as RMSProp, Adam, AMSGrad, etc, have been widely used in deep learning. Although these algorithms are faster in the early phase of training, their generalization performance is often not as good as stochastic gradient descent (SGD). Hence, a trade-off method of transforming Adam to SGD after a certain iteration to gain the merits of both algorithms is theoretically and practically significant. To that end, we propose a decreasing scaling transition scheme to achieve a smooth and stable transition from Adam to SGD, which is called DSTAdam. The convergence of the proposed DSTAdam is also proved in an online convex setting. Finally, the effectiveness of the DSTAdam is verified on the CIFAR-10/100 datasets. Our implementation is available at: https://github.com/kunzeng/DSTAdam.
52.7CLMay 7
Group of Skills: Group-Structured Skill Retrieval for Agent Skill LibrariesKun Zeng, Yu Huo, Siyu Zhang et al.
Skill-augmented agents increasingly rely on large reusable skill libraries, but retrieving relevant skills is not the same as presenting usable context. Existing methods typically return atomic skills or dependency-aware bundles whose internal roles remain implicit, leaving the agent to infer the execution entry point, support skills, visible requirements, and failure-avoidance guidance. We introduce Group of Skills (GoSkills), an inference-time group-structured retrieval method that changes the agent-facing retrieval object from a flat skill list to a compact, role-labeled execution context. GoSkills builds anchor-centered skill groups from a typed skill graph, expands support groups through a group graph, bottlenecks the selected group plan into a bounded set of atomic skill payloads, and renders a fixed execution contract with Start, Support, Check, and Avoid fields, without changing the downstream agent, skill payloads, or execution environment. Experiments on SkillsBench and ALFWorld show that GoSkills preserves visible-requirement coverage under a small skill budget, improves over flat skill-access baselines, and often improves reward and agent-only runtime relative to structural retrieval references.
CVFeb 26, 2022
Unsupervised Domain Adaptive Salient Object Detection Through Uncertainty-Aware Pseudo-Label LearningPengxiang Yan, Ziyi Wu, Mengmeng Liu et al.
Recent advances in deep learning significantly boost the performance of salient object detection (SOD) at the expense of labeling larger-scale per-pixel annotations. To relieve the burden of labor-intensive labeling, deep unsupervised SOD methods have been proposed to exploit noisy labels generated by handcrafted saliency methods. However, it is still difficult to learn accurate saliency details from rough noisy labels. In this paper, we propose to learn saliency from synthetic but clean labels, which naturally has higher pixel-labeling quality without the effort of manual annotations. Specifically, we first construct a novel synthetic SOD dataset by a simple copy-paste strategy. Considering the large appearance differences between the synthetic and real-world scenarios, directly training with synthetic data will lead to performance degradation on real-world scenarios. To mitigate this problem, we propose a novel unsupervised domain adaptive SOD method to adapt between these two domains by uncertainty-aware self-training. Experimental results show that our proposed method outperforms the existing state-of-the-art deep unsupervised SOD methods on several benchmark datasets, and is even comparable to fully-supervised ones.
IVMay 17, 2021
Real-Time Quantized Image Super-Resolution on Mobile NPUs, Mobile AI 2021 Challenge: ReportAndrey Ignatov, Radu Timofte, Maurizio Denna et al.
Image super-resolution is one of the most popular computer vision problems with many important applications to mobile devices. While many solutions have been proposed for this task, they are usually not optimized even for common smartphone AI hardware, not to mention more constrained smart TV platforms that are often supporting INT8 inference only. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based image super-resolution solutions that can demonstrate a real-time performance on mobile or edge NPUs. For this, the participants were provided with the DIV2K dataset and trained quantized models to do an efficient 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated NPU capable of accelerating quantized neural networks. The proposed solutions are fully compatible with all major mobile AI accelerators and are capable of reconstructing Full HD images under 40-60 ms while achieving high fidelity results. A detailed description of all models developed in the challenge is provided in this paper.
CVApr 15, 2018
SparseNet: A Sparse DenseNet for Image ClassificationWenqi Liu, Kun Zeng
Deep neural networks have made remarkable progresses on various computer vision tasks. Recent works have shown that depth, width and shortcut connections of networks are all vital to their performances. In this paper, we introduce a method to sparsify DenseNet which can reduce connections of a L-layer DenseNet from O(L^2) to O(L), and thus we can simultaneously increase depth, width and connections of neural networks in a more parameter-efficient and computation-efficient way. Moreover, an attention module is introduced to further boost our network's performance. We denote our network as SparseNet. We evaluate SparseNet on datasets of CIFAR(including CIFAR10 and CIFAR100) and SVHN. Experiments show that SparseNet can obtain improvements over the state-of-the-art on CIFAR10 and SVHN. Furthermore, while achieving comparable performances as DenseNet on these datasets, SparseNet is x2.6 smaller and x3.7 faster than the original DenseNet.