CVDec 5, 2022Code
Exploring Stroke-Level Modifications for Scene Text EditingYadong Qu, Qingfeng Tan, Hongtao Xie et al.
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text. However, due to the complicated background textures and various text styles, existing methods fall short in generating clear and legible edited text images. In this study, we attribute the poor editing performance to two problems: 1) Implicit decoupling structure. Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously. 2) Domain gap. Due to the lack of edited real scene text images, the network can only be well trained on synthetic pairs and performs poorly on real-world images. To handle the above problems, we propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL). Firstly, we generate stroke guidance maps to explicitly indicate regions to be edited. Different from the implicit one by directly modifying all the pixels at image level, such explicit instructions filter out the distractions from background and guide the network to focus on editing rules of text regions. Secondly, we propose a Semi-supervised Hybrid Learning to train the network with both labeled synthetic images and unpaired real scene text images. Thus, the STE model is adapted to real-world datasets distributions. Moreover, two new datasets (Tamper-Syn2k and Tamper-Scene) are proposed to fill the blank of public evaluation datasets. Extensive experiments demonstrate that our MOSTEL outperforms previous methods both qualitatively and quantitatively. Datasets and code will be available at https://github.com/qqqyd/MOSTEL.
CVOct 8, 2023Code
Symmetrical Linguistic Feature Distillation with CLIP for Scene Text RecognitionZixiao Wang, Hongtao Xie, Yuxin Wang et al.
In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP. Different from previous CLIP-based methods mainly considering feature generalization on visual encoding, we propose a symmetrical distillation strategy (SDS) that further captures the linguistic knowledge in the CLIP text encoder. By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow that covers not only visual but also linguistic information for distillation.Benefiting from the natural alignment in CLIP, such guidance flow provides a progressive optimization objective from vision to language, which can supervise the STR feature forwarding process layer-by-layer.Besides, a new Linguistic Consistency Loss (LCL) is proposed to enhance the linguistic capability by considering second-order statistics during the optimization. Overall, CLIP-OCR is the first to design a smooth transition between image and text for the STR task.Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.Code will be available at https://github.com/wzx99/CLIPOCR.
CLJul 6, 2023
VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question AnsweringZijun Yao, Yuanyong Chen, Xin Lv et al. · tsinghua
We present Visual Knowledge oriented Programming platform (VisKoP), a knowledge base question answering (KBQA) system that integrates human into the loop to edit and debug the knowledge base (KB) queries. VisKoP not only provides a neural program induction module, which converts natural language questions into knowledge oriented program language (KoPL), but also maps KoPL programs into graphical elements. KoPL programs can be edited with simple graphical operators, such as dragging to add knowledge operators and slot filling to designate operator arguments. Moreover, VisKoP provides auto-completion for its knowledge base schema and users can easily debug the KoPL program by checking its intermediate results. To facilitate the practical KBQA on a million-entity-level KB, we design a highly efficient KoPL execution engine for the back-end. Experiment results show that VisKoP is highly efficient and user interaction can fix a large portion of wrong KoPL programs to acquire the correct answer. The VisKoP online demo https://demoviskop.xlore.cn (Stable release of this paper) and https://viskop.xlore.cn (Beta release with new features), highly efficient KoPL engine https://pypi.org/project/kopl-engine, and screencast video https://youtu.be/zAbJtxFPTXo are now publicly available.
MLJan 23, 2023
Federated Sufficient Dimension Reduction Through High-Dimensional Sparse Sliced Inverse RegressionWenquan Cui, Yue Zhao, Jianjun Xu et al.
Federated learning has become a popular tool in the big data era nowadays. It trains a centralized model based on data from different clients while keeping data decentralized. In this paper, we propose a federated sparse sliced inverse regression algorithm for the first time. Our method can simultaneously estimate the central dimension reduction subspace and perform variable selection in a federated setting. We transform this federated high-dimensional sparse sliced inverse regression problem into a convex optimization problem by constructing the covariance matrix safely and losslessly. We then use a linearized alternating direction method of multipliers algorithm to estimate the central subspace. We also give approaches of Bayesian information criterion and hold-out validation to ascertain the dimension of the central subspace and the hyper-parameter of the algorithm. We establish an upper bound of the statistical error rate of our estimator under the heterogeneous setting. We demonstrate the effectiveness of our method through simulations and real world applications.
AIMay 21
MPDocBench-Parse: Benchmarking Practical Multi-page Document ParsingBangbang Zhou, Hangdi Xing, Yifan Chen et al.
Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for realistic scenarios. Existing benchmarks either focus on specific tasks or assess only single-page, text-centric settings, making them insufficient for practical multi-page parsing. Moreover, they lack fine-grained evaluation of semantic continuity, hierarchical structure recovery, and visual content preservation. To address these gaps, we propose MPDocBench-Parse, a benchmark for multi-page document parsing in real-world applications. It contains 433 manually annotated documents with 3,246 pages, covering 15 document types in English and Chinese, with diverse layout styles, and supports document-level end-to-end evaluation. We further design a comprehensive protocol for content fidelity and logical structure, covering text, table, and formula recognition, truncated text and table merging, figure extraction, reading order, and heading hierarchy recovery. Experiments show that, while existing models perform well on basic text extraction, they still suffer clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery. MPDocBench-Parse provides a unified foundation for advancing document parsing toward more realistic scenarios.
SEMay 17
Debug Like a Human: Scaling LLM-based Fault Localization to Processor Design via Block-Level Instruction-Oriented SlicingZizhen Liu, Xiaoguang Mao, Deheng Yang et al.
Fault localization in modern processor design code is a critical yet time-consuming step during processor verification. While recent advances in LLM-based techniques for module-level hardware design have shown promising results, automatically localizing bugs in large-scale, project-level processor designs remains challenging. In this paper, we present BluesFL, a novel block-level LLM-based fault localization framework for processor designs. Inspired by the way engineers debug processors, we first propose a dataflow-based code blockization approach to guide LLMs to focus on critical local code context. We further propose a Block-Level Instruction-Oriented Slicing (Blues) algorithm that enables LLMs to mimic human reasoning by analyzing instruction execution paths and processor states. We evaluate BluesFL on a real-world RISC-V processor core comprising 19K lines of SystemVerilog code. Experimental results demonstrate that BluesFL correctly localizes 24 bugs at Top-1, achieving 242.9% improvement over the existing state-of-the-art (7 bugs). Cost analysis shows that BluesFL requires an average of only $0.257 to localize a single bug.
LGJul 2, 2025Code
Test-Time Scaling with Reflective Generative ModelZixiao Wang, Yuxin Wang, Xiaorui Wang et al.
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
CLDec 12, 2024Code
GRIP: A Graph-Based Reasoning Instruction ProducerJiankang Wang, Jianjun Xu, Xiaorui Wang et al.
Large-scale, high-quality data is essential for advancing the reasoning capabilities of large language models (LLMs). As publicly available Internet data becomes increasingly scarce, synthetic data has emerged as a crucial research direction. However, existing data synthesis methods often suffer from limited scalability, insufficient sample diversity, and a tendency to overfit to seed data, which constrains their practical utility. In this paper, we present \textit{\textbf{GRIP}}, a \textbf{G}raph-based \textbf{R}easoning \textbf{I}nstruction \textbf{P}roducer that efficiently synthesizes high-quality and diverse reasoning instructions. \textit{GRIP} constructs a knowledge graph by extracting high-level concepts from seed data, and uniquely leverages both explicit and implicit relationships within the graph to drive large-scale and diverse instruction data synthesis, while employing open-source multi-model supervision to ensure data quality. We apply \textit{GRIP} to the critical and challenging domain of mathematical reasoning. Starting from a seed set of 7.5K math reasoning samples, we construct \textbf{GRIP-MATH}, a dataset containing 2.1 million synthesized question-answer pairs. Compared to similar synthetic data methods, \textit{GRIP} achieves greater scalability and diversity while also significantly reducing costs. On mathematical reasoning benchmarks, models trained with GRIP-MATH demonstrate substantial improvements over their base models and significantly outperform previous data synthesis methods.
MLFeb 28, 2023
Federated Covariate Shift Adaptation for Missing Target Output ValuesYaqian Xu, Wenquan Cui, Jianjun Xu et al.
The most recent multi-source covariate shift algorithm is an efficient hyperparameter optimization algorithm for missing target output. In this paper, we extend this algorithm to the framework of federated learning. For data islands in federated learning and covariate shift adaptation, we propose the federated domain adaptation estimate of the target risk which is asymptotically unbiased with a desirable asymptotic variance property. We construct a weighted model for the target task and propose the federated covariate shift adaptation algorithm which works preferably in our setting. The efficacy of our method is justified both theoretically and empirically.
CRApr 27
A Comparative Evaluation of AI Agent Security GuardrailsQi Li, Jiu Li, Pingtao Wei et al.
This report presents a comparative evaluation of DKnownAI Guard in AI agent security scenarios, benchmarked against three competing products: AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard. Using human annotation as the ground truth, we assess each guardrail's ability to detect two categories of risks: threats to the agent itself (e.g., instruction override, indirect injection, tool abuse) and requests intended to elicit harmful content (e.g., hate speech, pornography, violence). Evaluation results demonstrate that DKnownAI Guard achieves the highest recall rate at 96.5\% and ranks first in true negative rate (TNR) at 90.4\%, delivering the best overall performance among all evaluated guardrails.
CVMay 9, 2024Code
Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text RecognitionZuan Gao, Yuxin Wang, Yadong Qu et al.
In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at https://github.com/FaltingsA/SSM.
CVMay 9, 2023Code
Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text RecognitionBoqiang Zhang, Hongtao Xie, Yuxin Wang et al.
Vision model have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. However, due to lacking the perception of linguistic knowledge and information, recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. (2) the visual feature is suboptimal for the recognition in some vision-missing cases (e.g. occlusion, etc.). To address these issues, we propose a $\textbf{L}$inguistic $\textbf{P}$erception $\textbf{V}$ision model (LPV), which explores the linguistic capability of vision model for accurate text recognition. To alleviate the LID problem, we introduce a Cascade Position Attention (CPA) mechanism that obtains high-quality and accurate attention maps through step-wise optimization and linguistic information mining. Furthermore, a Global Linguistic Reconstruction Module (GLRM) is proposed to improve the representation of visual features by perceiving the linguistic information in the visual space, which gradually converts visual features into semantically rich ones during the cascade process. Different from previous methods, our method obtains SOTA results while keeping low complexity (92.4% accuracy with only 8.11M parameters). Code is available at https://github.com/CyrilSterling/LPV.
AINov 5, 2025
DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI AgentsQi Li, Jianjun Xu, Pingtao Wei et al.
With the widespread application of Large Language Models (LLMs), their associated security issues have become increasingly prominent, severely constraining their trustworthy deployment in critical domains. This paper proposes a novel safety response framework designed to systematically safeguard LLMs at both the input and output levels. At the input level, the framework employs a supervised fine-tuning-based safety classification model. Through a fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention), it performs precise risk identification and differentiated handling of user queries, significantly enhancing risk coverage and business scenario adaptability, and achieving a risk recall rate of 99.3%. At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned interpretation model, ensuring all responses are grounded in a real-time, trustworthy knowledge base. This approach eliminates information fabrication and enables result traceability. Experimental results demonstrate that our proposed safety control model achieves a significantly higher safety score on public safety evaluation benchmarks compared to the baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk test set, the framework's components attained a perfect 100% safety score, validating their exceptional protective capabilities in complex risk scenarios. This research provides an effective engineering pathway for building high-security, high-trust LLM applications.
MLMay 18, 2023
Functional sufficient dimension reduction through information maximization with application to classificationXinyu Li, Jianjun Xu, Wenquan Cui et al.
Considering the case where the response variable is a categorical variable and the predictor is a random function, two novel functional sufficient dimensional reduction (FSDR) methods are proposed based on mutual information and square loss mutual information. Compared to the classical FSDR methods, such as functional sliced inverse regression and functional sliced average variance estimation, the proposed methods are appealing because they are capable of estimating multiple effective dimension reduction directions in the case of a relatively small number of categories, especially for the binary response. Moreover, the proposed methods do not require the restrictive linear conditional mean assumption and the constant covariance assumption. They avoid the inverse problem of the covariance operator which is often encountered in the functional sufficient dimension reduction. The functional principal component analysis with truncation be used as a regularization mechanism. Under some mild conditions, the statistical consistency of the proposed methods is established. It is demonstrated that the two methods are competitive compared with some existing FSDR methods by simulations and real data analyses.
CLFeb 25, 2019
Pretraining-Based Natural Language Generation for Text SummarizationHaoyu Zhang, Jianjun Xu, Ji Wang
In this paper, we propose a novel pretraining-based encoder-decoder framework, which can generate the output sequence based on the input sequence in a two-stage manner. For the encoder of our model, we encode the input sequence into context representations using BERT. For the decoder, there are two stages in our model, in the first stage, we use a Transformer-based decoder to generate a draft output sequence. In the second stage, we mask each word of the draft sequence and feed it to BERT, then by combining the input sequence and the draft representation generated by BERT, we use a Transformer-based decoder to predict the refined word for each masked position. To the best of our knowledge, our approach is the first method which applies the BERT into text generation tasks. As the first step in this direction, we evaluate our proposed method on the text summarization task. Experimental results show that our model achieves new state-of-the-art on both CNN/Daily Mail and New York Times datasets.
HCSep 28, 2015
Recognition of Brain Waves of Left and Right Hand Movement Imagery with Portable ElectroencephalographsZhen Li, Jianjun Xu, Tingshao Zhu
With the development of the modern society, mind control applied to both the recovery of disabled individuals and auxiliary control of normal people has obtained great attention in numerous researches. In our research, we attempt to recognize the brain waves of left and right hand movement imagery with portable electroencephalographs. Considering the inconvenience of wearing traditional multiple-electrode electroencephalographs, we choose Muse to collect data which is a portable headband launched lately with a number of useful functions and channels and it is much easier for the public to use. Additionally, previous researches generally focused on discrimination of EEG of left and right hand movement imagery by using data from C3 and C4 electrodes which locate on the top of the head. However, we choose the gamma wave channels of F7 and F8 and obtain data when subjects imagine their left or right hand to move with their eyeballs rotated in the corresponding direction. With the help of the Common Space Pattern algorithm to extract features of brain waves between left and right hand movement imagery, we make use of the Support Vector Machine to classify different brain waves. Traditionally, the accuracy rate of classification was approximately 90% using the EEG data from C3 and C4 electrode poles; however, the accuracy rate reaches 95.1% by using the gamma wave data from F7 and F8 in our experiment. Finally, we design a plane program in Python where a plane can be controlled to go left or right when users imagine their left or right hand to move. 8 subjects are tested and all of them can control the plane flexibly which reveals that our model can be applied to control hardware which is useful for disabled individuals and normal people.
HCSep 25, 2015
Prediction of Brain States of Concentration and Relaxation in Real Time with Portable ElectroencephalographsZhen Li, Jianjun Xu, Tingshao Zhu
In our research, we attempt to help people recognize their brain state of concentration or relaxation more conveniently and in real time. Considering the inconvenience of wearing traditional multiple electrode electroencephalographs, we choose Muse to collect data which is a portable headband launched lately with a number of useful functions and channels and it is much easier for the public to use. Besides, traditional online analysis did not focus on the synchronism between users and computers and the time delay problem did exist. To solve the problem, by building the Analytic Hierarchy Model, we choose the two gamma wave channels of F7 and F8 as the data source instead of using both beta and alpha channels traditionally.Using the Common Space Pattern algorithm and the Support Vector Machine model, the channels we choose have a higher recognition accuracy rate and smaller amount of data to be dealt with by the computer than the traditional ones. Furthermore, we make use of the Feedforward Neural Network Model to predict subjects'brain states in half a second. Finally, we design a plane program in Python where a plane can be controlled to go up or down when users concentrate or relax. The SVM model and the Feedforward Neural Network model have both been tested by 12 subjects and they give an evaluation ranging from 1 to 10 points. The former gets 7.58 points while the latter gets 8.83, which proves that the time delay problem is improved once more.