CLJun 8, 2023
Assessing Phrase Break of ESL Speech with Pre-trained Language Models and Large Language ModelsZhiyi Wang, Shaoguang Mao, Wenshan Wu et al.
This work introduces approaches to assessing phrase breaks in ESL learners' speech using pre-trained language models (PLMs) and large language models (LLMs). There are two tasks: overall assessment of phrase break for a speech clip and fine-grained assessment of every possible phrase break position. To leverage NLP models, speech input is first force-aligned with texts, and then pre-processed into a token sequence, including words and phrase break information. To utilize PLMs, we propose a pre-training and fine-tuning pipeline with the processed tokens. This process includes pre-training with a replaced break token detection module and fine-tuning with text classification and sequence labeling. To employ LLMs, we design prompts for ChatGPT. The experiments show that with the PLMs, the dependence on labeled training data has been greatly reduced, and the performance has improved. Meanwhile, we verify that ChatGPT, a renowned LLM, has potential for further advancement in this area.
CVDec 4, 2025
Object Reconstruction under Occlusion with Generative Priors and Contact-induced ConstraintsMinghan Zhu, Zhiyi Wang, Qihang Sun et al.
Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.
CLOct 28, 2022
Assessing Phrase Break of ESL speech with Pre-trained Language ModelsZhiyi Wang, Shaoguang Mao, Wenshan Wu et al.
This work introduces an approach to assessing phrase break in ESL learners' speech with pre-trained language models (PLMs). Different with traditional methods, this proposal converts speech to token sequences, and then leverages the power of PLMs. There are two sub-tasks: overall assessment of phrase break for a speech clip; fine-grained assessment of every possible phrase break position. Speech input is first force-aligned with texts, then pre-processed to a token sequence, including words and associated phrase break information. The token sequence is then fed into the pre-training and fine-tuning pipeline. In pre-training, a replaced break token detection module is trained with token data where each token has a certain percentage chance to be randomly replaced. In fine-tuning, overall and fine-grained scoring are optimized with text classification and sequence labeling pipeline, respectively. With the introduction of PLMs, the dependence on labeled training data has been greatly reduced, and performance has improved.
LGSep 29, 2024
DropEdge not Foolproof: Effective Augmentation Method for Signed Graph Neural NetworksZeyu Zhang, Lu Li, Shuyan Wan et al.
The paper discusses signed graphs, which model friendly or antagonistic relationships using edges marked with positive or negative signs, focusing on the task of link sign prediction. While Signed Graph Neural Networks (SGNNs) have advanced, they face challenges like graph sparsity and unbalanced triangles. The authors propose using data augmentation (DA) techniques to address these issues, although many existing methods are not suitable for signed graphs due to a lack of side information. They highlight that the random DropEdge method, a rare DA approach applicable to signed graphs, does not enhance link sign prediction performance. In response, they introduce the Signed Graph Augmentation (SGA) framework, which includes a structure augmentation module to identify candidate edges and a strategy for selecting beneficial candidates, ultimately improving SGNN training. Experimental results show that SGA significantly boosts the performance of SGNN models, with a notable 32.3% improvement in F1-micro for SGCN on the Slashdot dataset.
PLApr 24
QCP: A Practical Separation Logic-based C Program Verification ToolXiwei Wu, Yueyang Feng, Xiaoyang Lu et al.
As software systems increase in size and complexity dramatically, ensuring their correctness, security, and reliability becomes an increasingly formidable challenge. Despite significant advancements in verification techniques and tools, their practical application to complex, real-world systems is often hindered by critical gaps in both automation and expressiveness. To address these difficulties, this paper presents \textbf{Qualified C Programming Verifier (QCP)}, a novel verification tool that integrates annotation-based automatic verification with interactive proving using Rocq. QCP employs symbolic execution and a separation logic entailment solver to automatically discharge many verification obligations, while deferring more complex obligations to Rocq for manual proof. Furthermore, QCP includes a VS Code extension designed to enhance proof efficiency and support a deeper understanding of both the program behavior and verification outcomes.
CVMar 30
Adapting SAM to Nuclei Instance Segmentation and Classification via Cooperative Fine-Grained RefinementJingze Su, Tianle Zhu, Jiaxin Cai et al.
Nuclei instance segmentation is critical in computational pathology for cancer diagnosis and prognosis. Recently, the Segment Anything Model has demonstrated exceptional performance in various segmentation tasks, leveraging its rich priors and powerful global context modeling capabilities derived from large-scale pre-training on natural images. However, directly applying SAM to the medical imaging domain faces significant limitations: it lacks sufficient perception of the local structural features that are crucial for nuclei segmentation, and full fine-tuning for downstream tasks requires substantial computational costs. To efficiently transfer SAM's robust prior knowledge to nuclei instance segmentation while supplementing its task-aware local perception, we propose a parameter-efficient fine-tuning framework, named Cooperative Fine-Grained Refinement of SAM, consisting of three core components: 1) a Multi-scale Adaptive Local-aware Adapter, which enables effective capability transfer by augmenting the frozen SAM backbone with minimal parameters and instilling a powerful perception of local structures through dynamically generated, multi-scale convolutional kernels; 2) a Hierarchical Modulated Fusion Module, which dynamically aggregates multi-level encoder features to preserve fine-grained spatial details; and 3) a Boundary-Guided Mask Refinement, which integrates multi-context boundary cues with semantic features through explicit supervision, producing a boundary-focused signal to refine initial mask predictions for sharper delineation. These three components work cooperatively to enhance local perception, preserve spatial details, and refine boundaries, enabling SAM to perform accurate nuclei instance segmentation directly.
CVOct 4, 2021
A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual RecognitionYuan Zhang, Jian Cao, Ling Zhang et al.
Learning subtle representation about object parts plays a vital role in fine-grained visual recognition (FGVR) field. The vision transformer (ViT) achieves promising results on computer vision due to its attention mechanism. Nonetheless, with the fixed size of patches in ViT, the class token in deep layer focuses on the global receptive field and cannot generate multi-granularity features for FGVR. To capture region attention without box annotations and compensate for ViT shortcomings in FGVR, we propose a novel method named Adaptive attention multi-scale Fusion Transformer (AFTrans). The Selective Attention Collection Module (SACM) in our approach leverages attention weights in ViT and filters them adaptively to correspond with the relative importance of input patches. The multiple scales (global and local) pipeline is supervised by our weights sharing encoder and can be easily trained end-to-end. Comprehensive experiments demonstrate that AFTrans can achieve SOTA performance on three published fine-grained benchmarks: CUB-200-2011, Stanford Dogs and iNat2017.