CLDec 3, 2022
The RoyalFlush System for the WMT 2022 Efficiency TaskBo Qin, Aixin Jia, Qiang Wang et al.
This paper describes the submission of the RoyalFlush neural machine translation system for the WMT 2022 translation efficiency task. Unlike the commonly used autoregressive translation system, we adopted a two-stage translation paradigm called Hybrid Regression Translation (HRT) to combine the advantages of autoregressive and non-autoregressive translation. Specifically, HRT first autoregressively generates a discontinuous sequence (e.g., make a prediction every $k$ tokens, $k>1$) and then fills in all previously skipped tokens at once in a non-autoregressive manner. Thus, we can easily trade off the translation quality and speed by adjusting $k$. In addition, by integrating other modeling techniques (e.g., sequence-level knowledge distillation and deep-encoder-shallow-decoder layer allocation strategy) and a mass of engineering efforts, HRT improves 80\% inference speed and achieves equivalent translation performance with the same-capacity AT counterpart. Our fastest system reaches 6k+ words/second on the GPU latency setting, estimated to be about 3.1x faster than the last year's winner.
LGJun 26, 2023
Hard Sample Mining Enabled Supervised Contrastive Feature Learning for Wind Turbine Pitch System Fault DiagnosisZixuan Wang, Bo Qin, Mengxuan Li et al.
The efficient utilization of wind power by wind turbines relies on the ability of their pitch systems to adjust blade pitch angles in response to varying wind speeds. However, the presence of multiple health conditions in the pitch system due to the long-term wear and tear poses challenges in accurately classifying them, thus increasing the maintenance cost of wind turbines or even damaging them. This paper proposes a novel method based on hard sample mining-enabled supervised contrastive learning (HSMSCL) to address this problem. The proposed method employs cosine similarity to identify hard samples and subsequently, leverages supervised contrastive learning to learn more discriminative representations by constructing hard sample pairs. Furthermore, the hard sample mining framework in the proposed method also constructs hard samples with learned representations to make the training process of the multilayer perceptron (MLP) more challenging and make it a more effective classifier. The proposed approach progressively improves the fault diagnosis model by introducing hard samples in the SCL and MLP phases, thus enhancing its performance in complex multi-class fault diagnosis tasks. To evaluate the effectiveness of the proposed method, two real datasets comprising wind turbine pitch system cog belt fracture data are utilized. The fault diagnosis performance of the proposed method is compared against existing methods, and the results demonstrate its superior performance. The proposed approach exhibits significant improvements in fault diagnosis performance, providing promising prospects for enhancing the reliability and efficiency of wind turbine pitch system fault diagnosis.
LGOct 20, 2023
Weighted Joint Maximum Mean Discrepancy Enabled Multi-Source-Multi-Target Unsupervised Domain Adaptation Fault DiagnosisZixuan Wang, Haoran Tang, Haibo Wang et al.
Despite the remarkable results that can be achieved by data-driven intelligent fault diagnosis techniques, they presuppose the same distribution of training and test data as well as sufficient labeled data. Various operating states often exist in practical scenarios, leading to the problem of domain shift that hinders the effectiveness of fault diagnosis. While recent unsupervised domain adaptation methods enable cross-domain fault diagnosis, they struggle to effectively utilize information from multiple source domains and achieve effective diagnosis faults in multiple target domains simultaneously. In this paper, we innovatively proposed a weighted joint maximum mean discrepancy enabled multi-source-multi-target unsupervised domain adaptation (WJMMD-MDA), which realizes domain adaptation under multi-source-multi-target scenarios in the field of fault diagnosis for the first time. The proposed method extracts sufficient information from multiple labeled source domains and achieves domain alignment between source and target domains through an improved weighted distance loss. As a result, domain-invariant and discriminative features between multiple source and target domains are learned with cross-domain fault diagnosis realized. The performance of the proposed method is evaluated in comprehensive comparative experiments on three datasets, and the experimental results demonstrate the superiority of this method.
CVDec 7, 2021Code
Handwritten Mathematical Expression Recognition via Attention Aggregation based Bi-directional Mutual LearningXiaohang Bian, Bo Qin, Xiaozhe Xin et al.
Handwritten mathematical expression recognition aims to automatically generate LaTeX sequences from given images. Currently, attention-based encoder-decoder models are widely used in this task. They typically generate target sequences in a left-to-right (L2R) manner, leaving the right-to-left (R2L) contexts unexploited. In this paper, we propose an Attention aggregation based Bi-directional Mutual learning Network (ABM) which consists of one shared encoder and two parallel inverse decoders (L2R and R2L). The two decoders are enhanced via mutual distillation, which involves one-to-one knowledge transfer at each training step, making full use of the complementary information from two inverse directions. Moreover, in order to deal with mathematical symbols in diverse scales, an Attention Aggregation Module (AAM) is proposed to effectively integrate multi-scale coverage attentions. Notably, in the inference phase, given that the model already learns knowledge from two inverse directions, we only use the L2R branch for inference, keeping the original parameter size and inference speed. Extensive experiments demonstrate that our proposed approach achieves the recognition accuracy of 56.85 % on CROHME 2014, 52.92 % on CROHME 2016, and 53.96 % on CROHME 2019 without data augmentation and model ensembling, substantially outperforming the state-of-the-art methods. The source code is available in https://github.com/XH-B/ABM.
HCOct 28, 2024
AutoGLM: Autonomous Foundation Agents for GUIsXiao Liu, Bo Qin, Dongzhu Liang et al. · tsinghua
We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs). While foundation models excel at acquiring human knowledge, they often struggle with decision-making in dynamic real-world environments, limiting their progress toward artificial general intelligence. This limitation underscores the importance of developing foundation agents capable of learning through autonomous environmental interactions by reinforcing existing models. Focusing on Web Browser and Phone as representative GUI scenarios, we have developed AutoGLM as a practical foundation agent system for real-world GUI interactions. Our approach integrates a comprehensive suite of techniques and infrastructures to create deployable agent systems suitable for user delivery. Through this development, we have derived two key insights: First, the design of an appropriate "intermediate interface" for GUI control is crucial, enabling the separation of planning and grounding behaviors, which require distinct optimization for flexibility and accuracy respectively. Second, we have developed a novel progressive training framework that enables self-evolving online curriculum reinforcement learning for AutoGLM. Our evaluations demonstrate AutoGLM's effectiveness across multiple domains. For web browsing, AutoGLM achieves a 55.2% success rate on VAB-WebArena-Lite (improving to 59.1% with a second attempt) and 96.2% on OpenTable evaluation tasks. In Android device control, AutoGLM attains a 36.2% success rate on AndroidLab (VAB-Mobile) and 89.7% on common tasks in popular Chinese APPs.
CVDec 23, 2025
VALLR-Pin: Dual-Decoding Visual Speech Recognition for Mandarin with Pinyin-Guided LLM RefinementChang Sun, Dongliang Xie, Bo Qin et al.
Visual Speech Recognition aims to transcribe spoken words from silent lip-motion videos. This task is particularly challenging for Mandarin, as visemes are highly ambiguous and homophones are prevalent. We propose VALLR-Pin, a novel two-stage framework that extends the recent VALLR architecture from English to Mandarin. First, a shared video encoder feeds into dual decoders, which jointly predict both Chinese character sequences and their standard Pinyin romanization. The multi-task learning of character and phonetic outputs fosters robust visual-semantic representations. During inference, the text decoder generates multiple candidate transcripts. We construct a prompt by concatenating the Pinyin output with these candidate Chinese sequences and feed it to a large language model to resolve ambiguities and refine the transcription. This provides the LLM with explicit phonetic context to correct homophone-induced errors. Finally, we fine-tune the LLM on synthetic noisy examples: we generate imperfect Pinyin-text pairs from intermediate VALLR-Pin checkpoints using the training data, creating instruction-response pairs for error correction. This endows the LLM with awareness of our model's specific error patterns. In summary, VALLR-Pin synergizes visual features with phonetic and linguistic context to improve Mandarin lip-reading performance.
CVMar 4, 2024
JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech RecognitionChang Sun, Hong Yang, Bo Qin
Visual Speech Recognition (VSR) tasks are generally recognized to have a lower theoretical performance ceiling than Automatic Speech Recognition (ASR), owing to the inherent limitations of conveying semantic information visually. To mitigate this challenge, this paper introduces an advanced knowledge distillation approach using a Joint-Embedding Predictive Architecture (JEPA), named JEP-KD, designed to more effectively utilize audio features during model training. Central to JEP-KD is the inclusion of a generative network within the embedding layer, which enhances the video encoder's capacity for semantic feature extraction and brings it into closer alignment with the audio features from a pre-trained ASR model's encoder. This approach aims to progressively reduce the performance gap between VSR and ASR. Moreover, a comprehensive multimodal, multistage training regimen for the JEP-KD framework is established, bolstering the robustness and efficacy of the training process. Experiment results demonstrate that JEP-KD significantly improves the performance of VSR models and demonstrates versatility across different VSR platforms, indicating its potential for broader application within other multimodal tasks.
LGDec 14, 2025
Skillful Subseasonal-to-Seasonal Forecasting of Extreme Events with a Multi-Sphere Coupled Probabilistic ModelBin Mu, Yuxuan Chen, Shijin Yuan et al.
Accurate subseasonal-to-seasonal (S2S) prediction of extreme events is critical for resource planning and disaster mitigation under accelerating climate change. However, such predictions remain challenging due to complex multi-sphere interactions and intrinsic atmospheric uncertainty. Here we present TianXing-S2S, a multi-sphere coupled probabilistic model for global S2S daily ensemble forecast. TianXing-S2S first encodes diverse multi-sphere predictors into a compact latent space, then employs a diffusion model to generate daily ensemble forecasts. A novel coupling module based on optimal transport (OT) is incorporated in the denoiser to optimize the interactions between atmospheric and multi-sphere boundary conditions. Across key atmospheric variables, TianXing-S2S outperforms both the European Centre for Medium-Range Weather Forecasts (ECMWF) S2S system and FuXi-S2S in 45-day daily-mean ensemble forecasts at 1.5 resolution. Our model achieves skillful subseasonal prediction of extreme events including heat waves and anomalous precipitation, identifying soil moisture as a critical precursor signal. Furthermore, we demonstrate that TianXing-S2S can generate stable rollout forecasts up to 180 days, establishing a robust framework for S2S research in a warming world.
LGMar 30, 2022
Slow-varying Dynamics Assisted Temporal Capsule Network for Machinery Remaining Useful Life EstimationYan Qin, Chau Yuen, Yimin Shao et al.
Capsule network (CapsNet) acts as a promising alternative to the typical convolutional neural network, which is the dominant network to develop the remaining useful life (RUL) estimation models for mechanical equipment. Although CapsNet comes with an impressive ability to represent the entities' hierarchical relationships through a high-dimensional vector embedding, it fails to capture the long-term temporal correlation of run-to-failure time series measured from degraded mechanical equipment. On the other hand, the slow-varying dynamics, which reveals the low-frequency information hidden in mechanical dynamical behaviour, is overlooked in the existing RUL estimation models, limiting the utmost ability of advanced networks. To address the aforementioned concerns, we propose a Slow-varying Dynamics assisted Temporal CapsNet (SD-TemCapsNet) to simultaneously learn the slow-varying dynamics and temporal dynamics from measurements for accurate RUL estimation. First, in light of the sensitivity of fault evolution, slow-varying features are decomposed from normal raw data to convey the low-frequency components corresponding to the system dynamics. Next, the long short-term memory (LSTM) mechanism is introduced into CapsNet to capture the temporal correlation of time series. To this end, experiments conducted on an aircraft engine and a milling machine verify that the proposed SD-TemCapsNet outperforms the mainstream methods. In comparison with CapsNet, the estimation accuracy of the aircraft engine with four different scenarios has been improved by 10.17%, 24.97%, 3.25%, and 13.03% concerning the index root mean squared error, respectively. Similarly, the estimation accuracy of the milling machine has been improved by 23.57% compared to LSTM and 19.54% compared to CapsNet.
CRDec 21, 2015
Flexible Attribute-Based Encryption Applicable to Secure E-Healthcare RecordsBo Qin, Hua Deng, Qianhong Wu et al.
In e-healthcare record systems (EHRS), attribute-based encryption (ABE) appears as a natural way to achieve fine-grained access control on health records. Some proposals exploit key-policy ABE (KP-ABE) to protect privacy in such a way that all users are associated with specific access policies and only the ciphertexts matching the users' access policies can be decrypted. An issue with KP-ABE is that it requires an a priori formulation of access policies during key generation, which is not always practicable in EHRS because the policies to access health records are sometimes determined after key generation. In this paper, we revisit KPABE and propose a dynamic ABE paradigm, referred to as access policy redefinable ABE (APR-ABE). To address the above issue, APR-ABE allows users to redefine their access policies and delegate keys for the redefined ones; hence a priori precise policies are no longer mandatory. We construct an APR-ABE scheme with short ciphertexts and prove its full security in the standard model under several static assumptions.
CRAug 7, 2015
On the Security of Privacy-Preserving Vehicular Communication Authentication with Hierarchical Aggregation and Fast ResponseLei Zhang, Chuanyan Hu, Qianhong Wu et al.
In [3], the authors proposed a highly efficient secure and privacy-preserving scheme for secure vehicular communications. The proposed scheme consists of four protocols: system setup, protocol for STP and STK distribution, protocol for common string synchronization, and protocol for vehicular communications. Here we define the security models for the protocol for STP and STK distribution, and the protocol for vehicular communications,respectively. We then prove that these two protocols are secure in our models.
CRJun 29, 2015
On the Security of MTA-OTIBASs (Multiple-TA One-Time Identity-Based Aggregate Signatures)Lei Zhang, Qianhong Wu, Josep Domingo-Ferrer et al.
In [3] the authors proposed a new aggregate signature scheme referred to as multiple-TA (trusted authority) one-time identity-based aggregate signature (MTA-OTIBAS). Further, they gave a concrete MTA-OTIBAS scheme. We recall here the definition of MTA-OTIBAS and the concrete proposed scheme. Then we prove that our MTA-OTIBAS concrete scheme is existentially unforgeable against adaptively chosen-message attacks in the random oracle model under the co-CDH problem assumption.