CLOct 12, 2022Code
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document UnderstandingQiming Peng, Yinxu Pan, Wenjin Wang et al.
Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at http://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout.
CLJun 1, 2023
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question AnsweringWenjin Wang, Yunhao Li, Yixin Ou et al.
Layout-aware pre-trained models has achieved significant progress on document image question answering. They introduce extra learnable modules into existing language models to capture layout information within document images from text bounding box coordinates obtained by OCR tools. However, extra modules necessitate pre-training on extensive document images. This prevents these methods from directly utilizing off-the-shelf instruction-tuning language foundation models, which have recently shown promising potential in zero-shot learning. Instead, in this paper, we find that instruction-tuning language models like Claude and ChatGPT can understand layout by spaces and line breaks. Based on this observation, we propose the LAyout and Task aware Instruction Prompt (LATIN-Prompt), which consists of layout-aware document content and task-aware instruction. Specifically, the former uses appropriate spaces and line breaks to recover the layout information among text segments obtained by OCR tools, and the latter ensures that generated answers adhere to formatting requirements. Moreover, we propose the LAyout and Task aware Instruction Tuning (LATIN-Tuning) to improve the performance of small instruction-tuning models like Alpaca. Experimental results show that LATIN-Prompt enables zero-shot performance of Claude and ChatGPT to be comparable to the fine-tuning performance of SOTAs on document image question answering, and LATIN-Tuning enhances the zero-shot performance of Alpaca significantly. For example, LATIN-Prompt improves the performance of Claude and ChatGPT on DocVQA by 263% and 20% respectively. LATIN-Tuning improves the performance of Alpaca on DocVQA by 87.7%. Quantitative and qualitative analyses demonstrate the effectiveness of LATIN-Prompt and LATIN-Tuning. We provide the code in supplementary and will release it to facilitate future research.
CVSep 18, 2022
ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document UnderstandingWenjin Wang, Zhengjie Huang, Bin Luo et al.
Recent efforts of multimodal Transformers have improved Visually Rich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a cluster-based method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters. Qualitative analyses show that our method can capture consistent semantics in coarse-grained elements.
CLJul 1, 2024
$\text{Memory}^3$: Language Modeling with Explicit MemoryHongkang Yang, Zehao Lin, Wenjin Wang et al.
The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining "abstract knowledge". As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named $\text{Memory}^3$, since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.
LGApr 11, 2023
Task Difficulty Aware Parameter Allocation & Regularization for Lifelong LearningWenjin Wang, Yunqing Hu, Qianglong Chen et al.
Parameter regularization or allocation methods are effective in overcoming catastrophic forgetting in lifelong learning. However, they solve all tasks in a sequence uniformly and ignore the differences in the learning difficulty of different tasks. So parameter regularization methods face significant forgetting when learning a new task very different from learned tasks, and parameter allocation methods face unnecessary parameter overhead when learning simple tasks. In this paper, we propose the Parameter Allocation & Regularization (PAR), which adaptively select an appropriate strategy for each task from parameter allocation and regularization based on its learning difficulty. A task is easy for a model that has learned tasks related to it and vice versa. We propose a divergence estimation method based on the Nearest-Prototype distance to measure the task relatedness using only features of the new task. Moreover, we propose a time-efficient relatedness-aware sampling-based architecture search strategy to reduce the parameter overhead for allocation. Experimental results on multiple benchmarks demonstrate that, compared with SOTAs, our method is scalable and significantly reduces the model's redundancy while improving the model's performance. Further qualitative analysis indicates that PAR obtains reasonable task-relatedness.
18.2CVMay 6
Optimize-at-Capture: Highly-adaptive Exposure Controlling for In-Vehicle Non-contact Heart-rate MonitoringJieying Wang, Xinqi Cai, Caifeng Shan et al.
Remote photoplethysmography (rPPG) holds great promise for continuous heart-rate monitoring of drivers in intelligent vehicles. However, its performance is severely degraded by the highly dynamic illumination changes. A critical yet overlooked factor is the lack of exposure controlling during video acquisition -- most existing systems rely on either fixed exposure settings or camera build-in auto-exposure, both of which fail to maintain stable facial brightness under rapidly changing lighting conditions during driving. To address this gap, we propose a highly-adaptive exposure controlling framework that proactively adjusts exposure parameters based on predictive modeling of historical skin reflections. Unlike standard auto-exposure, our method is specifically optimized for rPPG measurement, ensuring the skin region of interest (ROI) remains within the optimal dynamic range for rPPG signal extraction. As an important contribution of this study, we introduce ExpDrive, a public in-vehicle physiological monitoring dataset comprising synchronized facial video and reference ECG from 48 subjects captured under real driving conditions. Extensive experiments demonstrate that our method consistently outperforms fixed exposure and standard auto-exposure strategies. Specifically, it reduces the Mean Absolute Error (MAE) by 6.31 bpm (from 14.1 to 7.79 bpm) and significantly increases the success rate by 32.3 percentage points (p < 0.001) (from 24.9% to 57.2%) across challenging driving scenarios. Notably, it clearly improved the performance of non-contact heart-rate monitoring in both low-light (rainy) and high-glare (sunny) conditions, validating the efficacy of exposure-aware acquisition design.
SPFeb 13
Data-Driven Deep MIMO Detection:Network Architectures and Generalization AnalysisYongwei Yi, Xinping Yi, Wenjin Wang et al.
In practical Multiuser Multiple-Input Multiple-Output (MU-MIMO) systems, symbol detection remains challenging due to severe inter-user interference and sensitivity to Channel State Information (CSI) uncertainty. In contrast to the mostly studied belief propagation-type model-driven methods, which incur high computational complexity, Soft Interference Cancellation (SIC) strikes a good balance between performance and complexity. To further address CSI mismatch and nonlinear effects, the recently proposed data-driven deep neural receivers, such as DeepSIC, leverage the advantages of deep neural networks for interference cancellation and symbol detection, demonstrating strong empirical performance. However, there is still a lack of theoretical underpinning for why and to what extent DeepSIC could generalize with the number of training samples. This paper proposes inspecting the fully data-driven DeepSIC detection within a Network-of-MLPs architecture, which is composed of multiple interconnected MLPs via outer and inner Directed Acyclic Graphs (DAGs). Within such an architecture, DeepSIC can be upgraded as a graph-based message-passing process using Graph Neural Networks (GNNs), termed GNNSIC, with shared model parameters across users and iterations. Notably, GNNSIC achieves excellent expressivity comparable to DeepSIC with substantially fewer trainable parameters, resulting in improved sample efficiency and enhanced user generalization. By conducting a norm-based generalization analysis using Rademacher complexity, we reveal that an exponential dependence on the number of iterations for DeepSIC can be eliminated in GNNSIC due to parameter sharing. Simulation results demonstrate that GNNSIC attains comparable or improved Symbol Error Rate (SER) performance to DeepSIC with significantly fewer parameters and training samples.
CVSep 29, 2025Code
OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides UnderstandingJiancong Xie, Wenjin Wang, Zhuomeng Zhang et al.
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG-Bench, a comprehensive benchmark focused on One-Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi-automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image-text pairs. With OIG-Bench, we have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models. The results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual-text relationships. In addition, we also demonstrate that the proposed multi-agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high-quality image description generator and a valuable tool for future dataset construction. Datasets are available at https://github.com/XiejcSYSU/OIG-Bench.
CHEM-PHJun 28, 2021Code
LiteGEM: Lite Geometry Enhanced Molecular Representation Learning for Quantum Property PredictionShanzhuo Zhang, Lihang Liu, Sheng Gao et al.
In this report, we (SuperHelix team) present our solution to KDD Cup 2021-PCQM4M-LSC, a large-scale quantum chemistry dataset on predicting HOMO-LUMO gap of molecules. Our solution, Lite Geometry Enhanced Molecular representation learning (LiteGEM) achieves a mean absolute error (MAE) of 0.1204 on the test set with the help of deep graph neural networks and various self-supervised learning tasks. The code of the framework can be found in https://github.com/PaddlePaddle/PaddleHelix/tree/dev/competition/kddcup2021-PCQM4M-LSC/.
CLJan 30, 2024
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language ModelsYuanjie Lyu, Zhiyu Li, Simin Niu et al.
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This method addresses common LLM limitations, including outdated information and the tendency to produce inaccurate "hallucinated" content. However, the evaluation of RAG systems is challenging, as existing benchmarks are limited in scope and diversity. Most of the current benchmarks predominantly assess question-answering applications, overlooking the broader spectrum of situations where RAG could prove advantageous. Moreover, they only evaluate the performance of the LLM component of the RAG pipeline in the experiments, and neglect the influence of the retrieval component and the external knowledge database. To address these issues, this paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios. Specifically, we have categorized the range of RAG applications into four distinct types-Create, Read, Update, and Delete (CRUD), each representing a unique use case. "Create" refers to scenarios requiring the generation of original, varied content. "Read" involves responding to intricate questions in knowledge-intensive situations. "Update" focuses on revising and rectifying inaccuracies or inconsistencies in pre-existing texts. "Delete" pertains to the task of summarizing extensive texts into more concise forms. For each of these CRUD categories, we have developed comprehensive datasets to evaluate the performance of RAG systems. We also analyze the effects of various components of the RAG system, such as the retriever, the context length, the knowledge base construction, and the LLM. Finally, we provide useful insights for optimizing the RAG technology for different scenarios.
SPDec 17, 2025
QoS-Aware Hierarchical Reinforcement Learning for Joint Link Selection and Trajectory Optimization in SAGIN-Supported UAV Mobility ManagementJiayang Wan, Ke He, Yafei Wang et al.
Due to the significant variations in unmanned aerial vehicle (UAV) altitude and horizontal mobility, it becomes difficult for any single network to ensure continuous and reliable threedimensional coverage. Towards that end, the space-air-ground integrated network (SAGIN) has emerged as an essential architecture for enabling ubiquitous UAV connectivity. To address the pronounced disparities in coverage and signal characteristics across heterogeneous networks, this paper formulates UAV mobility management in SAGIN as a constrained multi-objective joint optimization problem. The formulation couples discrete link selection with continuous trajectory optimization. Building on this, we propose a two-level multi-agent hierarchical deep reinforcement learning (HDRL) framework that decomposes the problem into two alternately solvable subproblems. To map complex link selection decisions into a compact discrete action space, we conceive a double deep Q-network (DDQN) algorithm in the top-level, which achieves stable and high-quality policy learning through double Q-value estimation. To handle the continuous trajectory action space while satisfying quality of service (QoS) constraints, we integrate the maximum-entropy mechanism of the soft actor-critic (SAC) and employ a Lagrangian-based constrained SAC (CSAC) algorithm in the lower-level that dynamically adjusts the Lagrange multipliers to balance constraint satisfaction and policy optimization. Moreover, the proposed algorithm can be extended to multi-UAV scenarios under the centralized training and decentralized execution (CTDE) paradigm, which enables more generalizable policies. Simulation results demonstrate that the proposed scheme substantially outperforms existing benchmarks in throughput, link switching frequency and QoS satisfaction.
CLJan 7, 2024
Grimoire is All You Need for Enhancing Large Language ModelsDing Chen, Shichao Song, Qingchen Yu et al.
In-context Learning (ICL) is one of the key methods for enhancing the performance of large language models on specific tasks by providing a set of few-shot examples. However, the ICL capability of different types of models shows significant variation due to factors such as model architecture, volume of learning data, and the size of parameters. Generally, the larger the model's parameter size and the more extensive the learning data, the stronger its ICL capability. In this paper, we propose a method SLEICL that involves learning from examples using strong language models and then summarizing and transferring these learned skills to weak language models for inference and application. This ensures the stability and effectiveness of ICL. Compared to directly enabling weak language models to learn from prompt examples, SLEICL reduces the difficulty of ICL for these models. Our experiments, conducted on up to eight datasets with five language models, demonstrate that weak language models achieve consistent improvement over their own zero-shot or few-shot capabilities using the SLEICL method. Some weak language models even surpass the performance of GPT4-1106-preview (zero-shot) with the aid of SLEICL.
SPFeb 19, 2025
Deep-Unfolded Massive Grant-Free Transmission in Cell-Free Wireless Communication SystemsGangle Sun, Mengyao Cao, Wenjin Wang et al.
Grant-free transmission and cell-free communication are vital in improving coverage and quality-of-service for massive machine-type communication. This paper proposes a novel framework of joint active user detection, channel estimation, and data detection (JACD) for massive grant-free transmission in cell-free wireless communication systems. We formulate JACD as an optimization problem and solve it approximately using forward-backward splitting. To deal with the discrete symbol constraint, we relax the discrete constellation to its convex hull and propose two approaches that promote solutions from the constellation set. To reduce complexity, we replace costly computations with approximate shrinkage operations and approximate posterior mean estimator computations. To improve active user detection (AUD) performance, we introduce a soft-output AUD module that considers both the data estimates and channel conditions. To jointly optimize all algorithm hyper-parameters and to improve JACD performance, we further deploy deep unfolding together with a momentum strategy, resulting in two algorithms called DU-ABC and DU-POEM. Finally, we demonstrate the efficacy of the proposed JACD algorithms via extensive system simulations.
SPOct 2, 2025
Unlocking Symbol-Level Precoding Efficiency Through Tensor Equivariant Neural NetworkJinshuo Zhang, Yafei Wang, Xinping Yi et al.
Although symbol-level precoding (SLP) based on constructive interference (CI) exploitation offers performance gains, its high complexity remains a bottleneck. This paper addresses this challenge with an end-to-end deep learning (DL) framework with low inference complexity that leverages the structure of the optimal SLP solution in the closed-form and its inherent tensor equivariance (TE), where TE denotes that a permutation of the input induces the corresponding permutation of the output. Building upon the computationally efficient model-based formulations, as well as their known closed-form solutions, we analyze their relationship with linear precoding (LP) and investigate the corresponding optimality condition. We then construct a mapping from the problem formulation to the solution and prove its TE, based on which the designed networks reveal a specific parameter-sharing pattern that delivers low computational complexity and strong generalization. Leveraging these, we propose the backbone of the framework with an attention-based TE module, achieving linear computational complexity. Furthermore, we demonstrate that such a framework is also applicable to imperfect CSI scenarios, where we design a TE-based network to map the CSI, statistics, and symbols to auxiliary variables. Simulation results show that the proposed framework captures substantial performance gains of optimal SLP, while achieving an approximately 80-times speedup over conventional methods and maintaining strong generalization across user numbers and symbol block lengths.
CYApr 21, 2024
An AI-Enabled Framework Within Reach for Enhancing Healthcare Sustainability and FairnessBin Huang, Changchen Zhao, Zimeng Liu et al.
Good health and well-being is among key issues in the United Nations 2030 Sustainable Development Goals. The rising prevalence of large-scale infectious diseases and the accelerated aging of the global population are driving the transformation of healthcare technologies. In this context, establishing large-scale public health datasets, developing medical models, and creating decision-making systems with a human-centric approach are of strategic significance. Recently, by leveraging the extraordinary number of accessible cameras, groundbreaking advancements have emerged in AI methods for physiological signal monitoring and disease diagnosis using camera sensors. These approaches, requiring no specialized medical equipment, offer convenient manners of collecting large-scale medical data in response to public health events. Therefore, we outline a prospective framework and heuristic vision for a camera-based public health (CBPH) framework utilizing visual physiological monitoring technology. The CBPH can be considered as a convenient and universal framework for public health, advancing the United Nations Sustainable Development Goals, particularly in promoting the universality, sustainability, and equity of healthcare in low- and middle-income countries or regions. Furthermore, CBPH provides a comprehensive solution for building a large-scale and human-centric medical database, and a multi-task large medical model for public health and medical scientific discoveries. It has a significant potential to revolutionize personal monitoring technologies, digital medicine, telemedicine, and primary health care in public health. Therefore, it can be deemed that the outcomes of this paper will contribute to the establishment of a sustainable and fair framework for public health, which serves as a crucial bridge for advancing scientific discoveries in the realm of AI for medicine (AI4Medicine).
CVMay 16, 2021
Algorithmic Principles of Camera-based Respiratory Motion ExtractionWenjin Wang, Albertus C. den Brinker
Measuring the respiratory signal from a video based on body motion has been proposed and recently matured in products for video health monitoring. The core algorithm for this measurement is the estimation of tiny chest/abdominal motions induced by respiration, and the fundamental challenge is motion sensitivity. Though prior arts reported on the validation with real human subjects, there is no thorough/rigorous benchmark to quantify the sensitivities and boundary conditions of motion-based core respiratory algorithms that measure sub-pixel displacement between video frames. In this paper, we designed a setup with a fully-controllable physical phantom to investigate the essence of core algorithms, together with a mathematical model incorporating two motion estimation strategies and three spatial representations, leading to six algorithmic combinations for respiratory signal extraction. Their promises and limitations are discussed and clarified via the phantom benchmark. The insights gained in this paper are intended to improve the understanding and applications of camera-based respiration measurement in health monitoring.
LGSep 8, 2020
Masked Label Prediction: Unified Message Passing Model for Semi-Supervised ClassificationYunsheng Shi, Zhengjie Huang, Shikun Feng et al.
Graph neural network (GNN) and label propagation algorithm (LPA) are both message passing algorithms, which have achieved superior performance in semi-supervised classification. GNN performs feature propagation by a neural network to make predictions, while LPA uses label propagation across graph adjacency matrix to get results. However, there is still no effective way to directly combine these two kinds of algorithms. To address this issue, we propose a novel Unified Message Passaging Model (UniMP) that can incorporate feature and label propagation at both training and inference time. First, UniMP adopts a Graph Transformer network, taking feature embedding and label embedding as input information for propagation. Second, to train the network without overfitting in self-loop input label information, UniMP introduces a masked label prediction strategy, in which some percentage of input label information are masked at random, and then predicted. UniMP conceptually unifies feature propagation and label propagation and is empirically powerful. It obtains new state-of-the-art semi-supervised classification results in Open Graph Benchmark (OGB).
LGMar 19, 2020
Lifelong Learning with Searchable Extension UnitsWenjin Wang, Yunqing Hu, Yin Zhang
Lifelong learning remains an open problem. One of its main difficulties is catastrophic forgetting. Many dynamic expansion approaches have been proposed to address this problem, but they all use homogeneous models of predefined structure for all tasks. The common original model and expansion structures ignore the requirement of different model structures on different tasks, which leads to a less compact model for multiple tasks and causes the model size to increase rapidly as the number of tasks increases. Moreover, they can not perform best on all tasks. To solve those problems, in this paper, we propose a new lifelong learning framework named Searchable Extension Units (SEU) by introducing Neural Architecture Search into lifelong learning, which breaks down the need for a predefined original model and searches for specific extension units for different tasks, without compromising the performance of the model on different tasks. Our approach can obtain a much more compact model without catastrophic forgetting. The experimental results on the PMNIST, the split CIFAR10 dataset, the split CIFAR100 dataset, and the Mixture dataset empirically prove that our method can achieve higher accuracy with much smaller model, whose size is about 25-33 percentage of that of the state-of-the-art methods.
IVFeb 11, 2020
2.75D: Boosting learning by representing 3D Medical imaging to 2D features for small dataXin Wang, Ruisheng Su, Weiyi Xie et al.
In medical-data driven learning, 3D convolutional neural networks (CNNs) have started to show superior performance to 2D CNNs in numerous deep learning tasks, proving the added value of 3D spatial information in feature representation. However, the difficulty in collecting more training samples to converge, more computational resources and longer execution time make this approach less applied. Also, applying transfer learning on 3D CNN is challenging due to a lack of publicly available pre-trained 3D models. To tackle these issues, we proposed a novel 2D strategical representation of volumetric data, namely 2.75D. In this work, the spatial information of 3D images is captured in a single 2D view by a spiral-spinning technique. As a result, 2D CNN networks can also be used to learn volumetric information. Besides, we can fully leverage pre-trained 2D CNNs for downstream vision problems. We also explore a multi-view 2.75D strategy, 2.75D 3 channels (2.75Dx3), to boost the advantage of 2.75D. We evaluated the proposed methods on three public datasets with different modalities or organs (Lung CT, Breast MRI, and Prostate MRI), against their 2D, 2.5D, and 3D counterparts in classification tasks. Results show that the proposed methods significantly outperform other counterparts when all methods were trained from scratch on the lung dataset. Such performance gain is more pronounced with transfer learning or in the case of limited training data. Our methods also achieved comparable performance on other datasets. In addition, our methods achieved a substantial reduction in time consumption of training and inference compared with the 2.5D or 3D method.
CVNov 7, 2019
Analysis of CNN-based remote-PPG to understand limitations and sensitivitiesQi Zhan, Wenjin Wang, Gerard de Haan
Deep learning based on Convolutional Neural Network (CNN) has shown promising results in various vision-based applications, recently also in camera-based vital signs monitoring. The CNN-based Photoplethysmography (PPG) extraction has, so far, been focused on performance rather than understanding. In this paper, we try to answer four questions with experiments aiming at improving our understanding of this methodology as it gains popularity. We conclude that the network exploits the blood absorption variation to extract the physiological signals, and that the choice and parameters (phase, spectral content, etc.) of the reference-signal may be more critical than anticipated. The availability of multiple convolutional kernels is necessary for CNN to arrive at a flexible channel combination through the spatial operation, but may not provide the same motion-robustness as a multi-site measurement using knowledge-based PPG extraction. Finally, we conclude that the PPG-related prior knowledge is still helpful for the CNN-based PPG extraction. Consequently, we recommend further investigation of hybrid CNN-based methods to include prior knowledge in their design.
SPOct 27, 2019
Learning to Localize: A 3D CNN Approach to User Positioning in Massive MIMO-OFDM SystemsChi Wu, Xinping Yi, Wenjin Wang et al.
In this paper, we consider the user positioning problem in the massive multiple-input multiple-output (MIMO) orthogonal frequency-division multiplexing (OFDM) system with a uniform planner antenna (UPA) array. Taking advantage of the UPA array geometry and wide bandwidth, we advocate the use of the angle-delay channel power matrix (ADCPM) as a new type of fingerprint to replace the traditional ones. The ADCPM embeds the stable and stationary multipath characteristics, e.g. delay, power, and angle in the vertical and horizontal directions, which are beneficial to positioning. Taking ADCPM fingerprints as the inputs, we propose a novel three-dimensional (3D) convolution neural network (CNN) enabled learning method to localize users' 3D positions. In particular, such a 3D CNN model consists of a convolution refinement module to refine the elementary feature maps from the ADCPM fingerprints, three extended Inception modules to extract the advanced feature maps, and a regression module to estimate the 3D positions. By intensive simulations, the proposed 3D CNN-enabled positioning method is demonstrated to achieve higher positioning accuracy than the traditional searching-based ones, with reduced computational complexity and storage overhead, and the ADCPM fingerprints are more robust to noise contamination.