LiBo Wang

CV
h-index49
19papers
2,163citations
Novelty45%
AI Score51

19 Papers

LGNov 14, 2024Code
Reducing Reasoning Costs: The Path of Optimization for Chain of Thought via Sparse Attention Mechanism

Libo Wang

In order to address the chain of thought in the large language model inference cost surge, this research proposes to use a sparse attention mechanism that only focuses on a few relevant tokens. The researcher constructed a new attention mechanism and used GiantRabbit trained with custom GPTs as an experimental tool. The experiment tested and compared the reasoning time, correctness score and chain of thought length of this model and o1 Preview in solving the linear algebra test questions of MIT OpenCourseWare. The results show that GiantRabbit's reasoning time and chain of thought length are significantly lower than o1 Preview. It verifies the feasibility of sparse attention mechanism for optimizing chain of thought reasoning. Detailed architectural details and experimental process have been uploaded to Github, the link is:https://github.com/brucewang123456789/GeniusTrail.git.

LGNov 30, 2025Code
Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment

Libo Wang

To address the gap in humanoid robot cognitive systems regarding the lack of a time-updable mediating thought space between semantics and continuous control, this study constructs and trains a VLA model named "Sigma" that runs on a single RTX 4090. It uses the open-source pi05_base model as a foundation and preprocesses svla_so101_pickplace into a training dataset. The researcher independently designed an architecture for a vision-language-action model that combines deep semantic understanding and association to achieve telepathic communication. The training process involved repeated optimizations of data preprocessing, LoRA fine-tuning, and the inference-stage adapter. The experiment employed offline closed-loop replay, comparing Sigma with the untuned pure pi05_base_base model under data conditions. Results showed that Sigma exhibited a stable decrease in control MSE across vector, fragment, and entire trajectory timescales, while maintaining the telepathy norm and semantic-text alignment quality unchanged. It demonstrates that mind-responsive alignment control is quantified through an architecture that combines deep understanding of semantics and association without retraining the base model, which provides reproducible experience for semantic alignment and intention-driven behavior in humanoid robots.

CVSep 11, 2025Code
PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection

Sijun Dong, Yuxuan Hu, LiBo Wang et al.

To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.

CVJun 16, 2024Code
PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery

Libo Wang, Dongxu Li, Sijun Dong et al.

Semantic segmentation, as a basic tool for intelligent interpretation of remote sensing images, plays a vital role in many Earth Observation (EO) applications. Nowadays, accurate semantic segmentation of remote sensing images remains a challenge due to the complex spatial-temporal scenes and multi-scale geo-objects. Driven by the wave of deep learning (DL), CNN- and Transformer-based semantic segmentation methods have been explored widely, and these two architectures both revealed the importance of multi-scale feature representation for strengthening semantic information of geo-objects. However, the actual multi-scale feature fusion often comes with the semantic redundancy issue due to homogeneous semantic contents in pyramid features. To handle this issue, we propose a novel Mamba-based segmentation network, namely PyramidMamba. Specifically, we design a plug-and-play decoder, which develops a dense spatial pyramid pooling (DSPP) to encode rich multi-scale semantic features and a pyramid fusion Mamba (PFM) to reduce semantic redundancy in multi-scale feature fusion. Comprehensive ablation experiments illustrate the effectiveness and superiority of the proposed method in enhancing multi-scale feature representation as well as the great potential for real-time semantic segmentation. Moreover, our PyramidMamba yields state-of-the-art performance on three publicly available datasets, i.e. the OpenEarthMap (70.8% mIoU), ISPRS Vaihingen (84.8% mIoU) and Potsdam (88.0% mIoU) datasets. The code will be available at https://github.com/WangLibo1995/GeoSeg.

CVJan 5, 2025Code
Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method

Hui Li, Xiaoyu Ren, Hongjiu Yu et al.

Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live streaming for facial retouching, content recommendation, etc. However, previous FAP datasets are either small, closed-source, or lack diversity. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability. To overcome these limitations, in this paper we present LiveBeauty, the first large-scale live-specific FAP dataset, in a more challenging application scenario, i.e., live streaming. 10,000 face images are collected from a live streaming platform directly, with 200,000 corresponding attractiveness annotations obtained from a well-devised subjective experiment, making LiveBeauty the largest open-access FAP dataset in the challenging live scenario. Furthermore, a multi-modal FAP method is proposed to measure the facial attractiveness in live streaming. Specifically, we first extract holistic facial prior knowledge and multi-modal aesthetic semantic features via a Personalized Attractiveness Prior Module (PAPM) and a Multi-modal Attractiveness Encoder Module (MAEM), respectively, then integrate the extracted features through a Cross-Modal Fusion Module (CMFM). Extensive experiments conducted on both LiveBeauty and other open-source FAP datasets demonstrate that our proposed method achieves state-of-the-art performance. Dataset will be available soon.

CRNov 23, 2024Code
"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks

Libo Wang

As the application of large language models continues to expand in various fields, it poses higher challenges to the effectiveness of identifying harmful content generation and guardrail mechanisms. This research aims to evaluate the guardrail effectiveness of GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5, and Claude 3.5 Sonnet through black-box testing of seemingly ethical multi-step jailbreak prompts. It conducts ethical attacks by designing an identical multi-step prompts that simulates the scenario of "corporate middle managers competing for promotions." The data results show that the guardrails of the above-mentioned LLMs were bypassed and the content of verbal attacks was generated. Claude 3.5 Sonnet's resistance to multi-step jailbreak prompts is more obvious. To ensure objectivity, the experimental process, black box test code, and enhanced guardrail code are uploaded to the GitHub repository: https://github.com/brucewang123456789/GeniusTrail.git.

CVSep 18, 2021Code
UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery

Libo Wang, Rui Li, Ce Zhang et al.

Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment.Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for local information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global-local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512x512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.

CVJun 23, 2021Code
Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images

Libo Wang, Rui Li, Dongzhi Wang et al.

Semantic segmentation from very fine resolution (VFR) urban scene images plays a significant role in several application scenarios including autonomous driving, land cover classification, and urban planning, etc. However, the tremendous details contained in the VFR image, especially the considerable variations in scale and appearance of objects, severely limit the potential of the existing deep learning approaches. Addressing such issues represents a promising research field in the remote sensing community, which paves the way for scene-level landscape pattern analysis and decision making. In this paper, we propose a Bilateral Awareness Network which contains a dependency path and a texture path to fully capture the long-range relationships and fine-grained details in VFR images. Specifically, the dependency path is conducted based on the ResT, a novel Transformer backbone with memory-efficient multi-head self-attention, while the texture path is built on the stacked convolution operation. Besides, using the linear attention mechanism, a feature aggregation module is designed to effectively fuse the dependency features and texture features. Extensive experiments conducted on the three large-scale urban scene image segmentation datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid dataset, demonstrate the effectiveness of our BANet. Specifically, a 64.6% mIoU is achieved on the UAVid dataset. Code is available at https://github.com/WangLibo1995/GeoSeg.

CVApr 25, 2021Code
A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images

Libo Wang, Rui Li, Chenxi Duan et al.

The fully convolutional network (FCN) with an encoder-decoder architecture has been the standard paradigm for semantic segmentation. The encoder-decoder architecture utilizes an encoder to capture multilevel feature maps, which are incorporated into the final prediction by a decoder. As the context is crucial for precise segmentation, tremendous effort has been made to extract such information in an intelligent fashion, including employing dilated/atrous convolutions or inserting attention modules. However, these endeavors are all based on the FCN architecture with ResNet or other backbones, which cannot fully exploit the context from the theoretical concept. By contrast, we introduce the Swin Transformer as the backbone to extract the context information and design a novel decoder of densely connected feature aggregation module (DCFAM) to restore the resolution and produce the segmentation map. The experimental results on two remotely sensed semantic segmentation datasets demonstrate the effectiveness of the proposed scheme.Code is available at https://github.com/WangLibo1995/GeoSeg

CVFeb 16, 2021Code
A2-FPN for Semantic Segmentation of Fine-Resolution Remotely Sensed Images

Rui Li, Shunyi Zheng, Ce Zhang et al.

Semantic segmentation using fine-resolution remotely sensed images plays a critical role in many practical applications, such as urban planning, environmental protection, natural and anthropogenic landscape monitoring, etc. However, the automation of semantic segmentation, i.e., automatic categorization/labeling and segmentation is still a challenging task, particularly for fine-resolution images with huge spatial and spectral complexity. Addressing such a problem represents an exciting research field, which paves the way for scene-level landscape pattern analysis and decision making. In this paper, we propose an approach for automatic land segmentation based on the Feature Pyramid Network (FPN). As a classic architecture, FPN can build a feature pyramid with high-level semantics throughout. However, intrinsic defects in feature extraction and fusion hinder FPN from further aggregating more discriminative features. Hence, we propose an Attention Aggregation Module (AAM) to enhance multi-scale feature learning through attention-guided feature aggregation. Based on FPN and AAM, a novel framework named Attention Aggregation Feature Pyramid Network (A2-FPN) is developed for semantic segmentation of fine-resolution remotely sensed images. Extensive experiments conducted on three datasets demonstrate the effectiveness of our A2 -FPN in segmentation accuracy. Code is available at https://github.com/lironui/A2-FPN.

CVDec 20, 2023
MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images

Libo Wang, Sijun Dong, Ying Chen et al.

Semantic segmentation of remote sensing images plays a vital role in a wide range of Earth Observation applications, such as land use land cover mapping, environment monitoring, and sustainable development. Driven by rapid developments in artificial intelligence, deep learning (DL) has emerged as the mainstream for semantic segmentation and has achieved many breakthroughs in the field of remote sensing. However, most DL-based methods focus on unimodal visual data while ignoring rich multimodal information involved in the real world. Non-visual data, such as text, can gather extra knowledge from the real world, which can strengthen the interpretability, reliability, and generalization of visual models. Inspired by this, we propose a novel metadata-collaborative segmentation network (MetaSegNet) that applies vision-language representation learning for semantic segmentation of remote sensing images. Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic (e.g. the climate zone) from freely available remote sensing image metadata and transfer it into geographic text prompts via the generic ChatGPT. Then, we construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature and apply image-text interaction. Benefiting from such a design, the proposed MetaSegNet not only demonstrates superior generalization in zero-shot testing but also achieves competitive accuracy with the state-of-the-art semantic segmentation methods on the large-scale OpenEarthMap dataset (70.4% mIoU) and the Potsdam dataset (93.3% mean F1 score) as well as the LoveDA dataset (52.0% mIoU).

AIFeb 7, 2025
Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning

Libo Wang

To reduce the cost and consumption of computing resources caused by computational redundancy and delayed reward assignment in long CoT, this research proposes the dynamic chain-of-thought (D-CoT) with adaptive reasoning time and steps. The researcher used simulation experiment to simulate the integration of D-CoT through Python 3.13 IDLE combined with a Python simulator based on GPTs. At the same time, the researcher used DeepSeek R1 as a control group to test and compare the performance of the D-CoT simulator in processing MIT OpenCourseWare's linear algebra exam questions. Experimental results show that D-CoT is better than DeepSeek R1 based on long CoT in three indicators: reasoning time, CoT length (reasoning steps) and token count, which achieves a significant reduction in computing resource consumption. In addition, this research has potential value in deep reasoning optimization that is used as a reference for future dynamic deep reasoning frameworks.

RODec 29, 2024
Multi-Scenario Reasoning: Unlocking Cognitive Autonomy in Humanoid Robots for Multimodal Understanding

Libo Wang

To improve the cognitive autonomy of humanoid robots, this research proposes a multi-scenario reasoning architecture to solve the technical shortcomings of multi-modal understanding in this field. It draws on simulation based experimental design that adopts multi-modal synthesis (visual, auditory, tactile) and builds a simulator "Maha" to perform the experiment. The findings demonstrate the feasibility of this architecture in multimodal data. It provides reference experience for the exploration of cross-modal interaction strategies for humanoid robots in dynamic environments. In addition, multi-scenario reasoning simulates the high-level reasoning mechanism of the human brain to humanoid robots at the cognitive level. This new concept promotes cross-scenario practical task transfer and semantic-driven action planning. It heralds the future development of self-learning and autonomous behavior of humanoid robots in changing scenarios.

AINov 15, 2024
Mitigating Sycophancy in Decoder-Only Transformer Architectures: Synthetic Data Intervention

Libo Wang

To address the sycophancy problem caused by reinforcement learning from human feedback in large language models, this research applies synthetic data intervention technology to the decoder-only transformer architecture. Based on the research gaps in the existing literature, the researcher designed an experimental process to reduce the tendency of models to cater by generating diversified data, and used GPT4o as an experimental tool for verification. The experiment used 100 true and false questions, and compared the performance of the model trained with synthetic data intervention and the original untrained model on multiple indicators. The results show that the SDI training model supports the technology in terms of accuracy rate and sycophancy rate and has significant effectiveness in reducing sycophancy phenomena.

LGJun 9, 2025
Graph-of-Causal Evolution: Challenging Chain-of-Model for Reasoning

Libo Wang

In view of the problem that each subchain in the chain-of-model (CoM) relies only on the information of the previous subchain and may lose long-range dependencies due to the causal mask blocking the global context flow between multi-level subchains, this work proposes a graph of causal evolution (GoCE). Its core principle is to map the implicit token representation into a differentiable and sparse causal adjacency matrix, then permeate causal constraints through each layer of calculation using causal-masked attention and causal-MoE. By combining intervention consistency loss test and self-evolution gate, the dynamic balance between causal structure learning and adaptive updating of transformer architecture is realized. The researcher built experimental environments in sandboxes built with Claude Sonnet 4, o4-mini-high, and DeepSeek R1 respectively with the transformer variant architecture introduced in GoCE. It is evaluated on publicly available datasets including CLUTRR, CLADDER, EX-FEVER, and CausalQA and compared with the baseline LLMs. The finding proves that GoCE strengthens the transformer's ability to capture long-range causal dependencies, while the ability to self-evolve is improved. It not only surpasses the design of CoM in terms of design principles, but also provides experience for future research on causal learning and continuous adaptive improvement.

ROMay 25, 2025
Towards Humanoid Robot Autonomy: A Dynamic Architecture Integrating Continuous thought Machines (CTM) and Model Context Protocol (MCP)

Libo Wang

To address the gaps between the static pre-set "thinking-planning-action" of humanoid robots in unfamiliar scenarios and the highly programmed "call tool-return result" due to the lack of autonomous coding capabilities, this work designs a dynamic architecture connecting continuous thought machines (CTM) and model context protocol (MCP). It proposes a theoretical parallel solution through tick-slab and uses rank compression to achieve parameter suppression to provide a solution for achieving autonomous actions due to autonomous coding. The researcher used a simulation-based experiment using OpenAI's o4-mini-high as a tool to build the experimental environment, and introduced the extended SayCan dataset to conduct nine epochs of experiments. The experimental results show that the CTM-MCP architecture is feasible and effective through the data results of seven metrics: task success rate (TSR), execution success rate (ESR), average episode length (AEL), ROSCOE, REVEAL, proficiency self-assessment (PSA), task effectiveness (TE). In practice, it provides a reference experience for exploring the autonomous dynamic coding of humanoid robots based on continuous thinking to achieve human-like autonomous actions.

LGJan 24, 2025
Wormhole Memory: A Rubik's Cube for Cross-Dialogue Retrieval

Libo Wang

In view of the gap in the current large language model in sharing memory across dialogues, this research proposes a wormhole memory module (WMM) to realize memory as a Rubik's cube that can be arbitrarily retrieved between different dialogues. Through simulation experiments, the researcher built an experimental framework based on the Python environment and used setting memory barriers to simulate the current situation where memories between LLMs dialogues are difficult to share. The CoQA development data set was imported into the experiment, and the feasibility of its cross-dialogue memory retrieval function was verified for WMM's nonlinear indexing and dynamic retrieval, and a comparative analysis was conducted with the capabilities of Titans and MemGPT memory modules. Experimental results show that WMM demonstrated the ability to retrieve memory across dialogues and the stability of quantitative indicators in eight experiments. It contributes new technical approaches to the optimization of memory management of LLMs and provides experience for the practical application in the future.

CVNov 29, 2021
Building extraction with vision transformer

Libo Wang, Shenghui Fang, Rui Li et al.

As an important carrier of human productive activities, the extraction of buildings is not only essential for urban dynamic monitoring but also necessary for suburban construction inspection. Nowadays, accurate building extraction from remote sensing images remains a challenge due to the complex background and diverse appearances of buildings. The convolutional neural network (CNN) based building extraction methods, although increased the accuracy significantly, are criticized for their inability for modelling global dependencies. Thus, this paper applies the Vision Transformer for building extraction. However, the actual utilization of the Vision Transformer often comes with two limitations. First, the Vision Transformer requires more GPU memory and computational costs compared to CNNs. This limitation is further magnified when encountering large-sized inputs like fine-resolution remote sensing images. Second, spatial details are not sufficiently preserved during the feature extraction of the Vision Transformer, resulting in the inability for fine-grained building segmentation. To handle these issues, we propose a novel Vision Transformer (BuildFormer), with a dual-path structure. Specifically, we design a spatial-detailed context path to encode rich spatial details and a global context path to capture global dependencies. Besides, we develop a window-based linear multi-head self-attention to make the complexity of the multi-head self-attention linear with the window size, which strengthens the global context extraction by using large windows and greatly improves the potential of the Vision Transformer in processing large-sized remote sensing images. The proposed method yields state-of-the-art performance (75.74% IoU) on the Massachusetts building dataset. Code will be available.

CVMar 14, 2021
Scale-aware Neural Network for Semantic Segmentation of Multi-resolution Remote Sensing Images

Libo Wang, Ce Zhang, Rui Li et al.

Assigning geospatial objects with specific categories at the pixel level is a fundamental task in remote sensing image analysis. Along with rapid development in sensor technologies, remotely sensed images can be captured at multiple spatial resolutions (MSR) with information content manifested at different scales. Extracting information from these MSR images represents huge opportunities for enhanced feature representation and characterisation. However, MSR images suffer from two critical issues: 1) increased scale variation of geo-objects and 2) loss of detailed information at coarse spatial resolutions. To bridge these gaps, in this paper, we propose a novel scale-aware neural network (SaNet) for semantic segmentation of MSR remotely sensed imagery. SaNet deploys a densely connected feature network (DCFFM) module to capture high-quality multi-scale context, such that the scale variation is handled properly and the quality of segmentation is increased for both large and small objects. A spatial feature recalibration (SFRM) module is further incorporated into the network to learn intact semantic content with enhanced spatial relationships, where the negative effects of information loss are removed. The combination of DCFFM and SFRM allows SaNet to learn scale-aware feature representation, which outperforms the existing multi-scale feature representation. Extensive experiments on three semantic segmentation datasets demonstrated the effectiveness of the proposed SaNet in cross-resolution segmentation.