LGJul 11, 2022
Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA StorageAlan J. X. Guo, Cong Liang, Qing-Hu Hou
Storing information in DNA molecules is of great interest because of its advantages in longevity, high storage density, and low maintenance cost. A key step in the DNA storage pipeline is to efficiently cluster the retrieved DNA sequences according to their similarities. Levenshtein distance is the most suitable metric on the similarity between two DNA sequences, but it is inferior in terms of computational complexity and less compatible with mature clustering algorithms. In this work, we propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression. The Levenshtein distance is approximated by the squared Euclidean distance between the embedding vectors, which is fast calculated and clustering algorithm friendly. The proposed approach is analyzed theoretically and experimentally. The results show that the proposed embedding is efficient and robust.
ITJul 10, 2024
Disturbance-based Discretization, Differentiable IDS Channel, and an IDS-Correcting Code for DNA-based StorageAlan J. X. Guo, Mengyi Wei, Yufan Dai et al.
With recent advancements in next-generation data storage, especially in biological molecule-based storage, insertion, deletion, and substitution (IDS) error-correcting codes have garnered increased attention. However, a universal method for designing tailored IDS-correcting codes across varying channel settings remains underexplored. We present an autoencoder-based approach, THEA-code, aimed at efficiently generating IDS-correcting codes for complex IDS channels. In the work, a disturbance-based discretization is proposed to discretize the features of the autoencoder, and a simulated differentiable IDS channel is developed as a differentiable alternative for IDS operations. These innovations facilitate the successful convergence of the autoencoder, producing channel-customized IDS-correcting codes that demonstrate commendable performance across complex IDS channels, particularly in realistic DNA-based storage channels.
LGDec 13, 2023
Levenshtein Distance Embedding with Poisson Regression for DNA StorageXiang Wei, Alan J. X. Guo, Sihan Sun et al.
Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
ITDec 20, 2023
DoDo-Code: an Efficient Levenshtein Distance Embedding-based Code for 4-ary IDS ChannelAlan J. X. Guo, Sihan Sun, Xiang Wei et al.
With the emergence of new storage and communication methods, the insertion, deletion, and substitution (IDS) channel has attracted considerable attention. However, many topics on the IDS channel and the associated Levenshtein distance remain open, making the invention of a novel IDS-correcting code a hard task. Furthermore, current studies on single-IDS-correcting code misalign with the requirements of applications which necessitates the correcting of multiple errors. Compromise solutions have involved shortening codewords to reduce the chance of multiple errors. However, the code rates of existing codes are poor at short lengths, diminishing the overall storage density. In this study, a novel method is introduced for designing high-code-rate single-IDS-correcting codewords through deep Levenshtein distance embedding. A deep learning model is utilized to project the sequences into embedding vectors that preserve the Levenshtein distances between the original sequences. This embedding space serves as a proxy for the complex Levenshtein domain, within which algorithms for codeword search and segment correcting is developed. While the concept underpinning this approach is straightforward, it bypasses the mathematical challenges typically encountered in code design. The proposed method results in a code rate that outperforms existing combinatorial solutions, particularly for designing short-length codewords.
LGFeb 28, 2025
Efficient Transformer-based Decoder for Varshamov-Tenengolts CodesYali Wei, Alan J. X. Guo, Zihui Yan et al.
In recent years, the rise of DNA data storage technology has brought significant attention to the challenge of correcting insertion, deletion, and substitution (IDS) errors. Among various coding methods for IDS correction, Varshamov-Tenengolts (VT) codes, primarily designed for single-error correction, have emerged as a central research focus. While existing decoding methods achieve high accuracy in correcting a single error, they often fail to correct multiple IDS errors. In this work, we observe that VT codes retain some capability for addressing multiple errors by introducing a transformer-based VT decoder (TVTD) along with symbol- and statistic-based codeword embedding. Experimental results demonstrate that the proposed TVTD achieves perfect correction of a single error. Furthermore, when decoding multiple errors across various codeword lengths, the bit error rate and frame error rate are significantly improved compared to existing hard decision and soft-in soft-out algorithms. Additionally, through model architecture optimization, the proposed method reduces time consumption by an order of magnitude compared to other soft decoders.
LGApr 5, 2021
Improving the Expressive Power of Graph Neural Network with Tinhofer AlgorithmAlan J. X. Guo, Qing-Hu Hou, Ou Wu
In recent years, Graph Neural Network (GNN) has bloomly progressed for its power in processing graph-based data. Most GNNs follow a message passing scheme, and their expressive power is mathematically limited by the discriminative ability of the Weisfeiler-Lehman (WL) test. Following Tinhofer's research on compact graphs, we propose a variation of the message passing scheme, called the Weisfeiler-Lehman-Tinhofer GNN (WLT-GNN), that theoretically breaks through the limitation of the WL test. In addition, we conduct comparative experiments and ablation studies on several well-known datasets. The results show that the proposed methods have comparable performances and better expressive power on these datasets.
CVApr 1, 2020
Improving Deep Hyperspectral Image Classification Performance with Spectral UnmixingAlan J. X. Guo, Fei Zhu
Recent advances in neural networks have made great progress in the hyperspectral image (HSI) classification. However, the overfitting effect, which is mainly caused by complicated model structure and small training set, remains a major concern. Reducing the complexity of the neural networks could prevent overfitting to some extent, but also declines the networks' ability to express more abstract features. Enlarging the training set is also difficult, for the high expense of acquisition and manual labeling. In this paper, we propose an abundance-based multi-HSI classification method. Firstly, we convert every HSI from the spectral domain to the abundance domain by a dataset-specific autoencoder. Secondly, the abundance representations from multiple HSIs are collected to form an enlarged dataset. Lastly, we train an abundance-based classifier and employ the classifier to predict over all the involved HSI datasets. Different from the spectra that are usually highly mixed, the abundance features are more representative in reduced dimension with less noise. This benefits the proposed method to employ simple classifiers and enlarged training data, and to expect less overfitting issues. The effectiveness of the proposed method is verified by the ablation study and the comparative experiments.
CVMar 4, 2019
Hyperspectral Image Classification with Deep Metric Learning and Conditional Random FieldYi Liang, Xin Zhao, Alan J. X. Guo et al.
To improve the classification performance in the context of hyperspectral image processing, many works have been developed based on two common strategies, namely the spatial-spectral information integration and the utilization of neural networks. However, both strategies typically require more training data than the classical algorithms, aggregating the shortage of labeled samples. In this letter, we propose a novel framework that organically combines the spectrum-based deep metric learning model and the conditional random field algorithm. The deep metric learning model is supervised by the center loss to produce spectrum-based features that gather more tightly in Euclidean space within classes. The conditional random field with Gaussian edge potentials, which is firstly proposed for image segmentation tasks, is introduced to give the pixel-wise classification over the hyperspectral image by utilizing both the geographical distances between pixels and the Euclidean distances between the features produced by the deep metric learning model. The proposed framework is trained by spectral pixels at the deep metric learning stage and utilizes the half handcrafted spatial features at the conditional random field stage. This settlement alleviates the shortage of training data to some extent. Experiments on two real hyperspectral images demonstrate the advantages of the proposed method in terms of both classification accuracy and computation cost.
CVJan 31, 2018
A CNN-based Spatial Feature Fusion Algorithm for Hyperspectral Imagery ClassificationAlan J. X. Guo, Fei Zhu
The shortage of training samples remains one of the main obstacles in applying the artificial neural networks (ANN) to the hyperspectral images classification. To fuse the spatial and spectral information, pixel patches are often utilized to train a model, which may further aggregate this problem. In the existing works, an ANN model supervised by center-loss (ANNC) was introduced. Training merely with spectral information, the ANNC yields discriminative spectral features suitable for the subsequent classification tasks. In this paper, a CNN-based spatial feature fusion (CSFF) algorithm is proposed, which allows a smart fusion of the spatial information to the spectral features extracted by ANNC. As a critical part of CSFF, a CNN-based discriminant model is introduced to estimate whether two paring pixels belong to the same class. At the testing stage, by applying the discriminant model to the pixel-pairs generated by the test pixel and its neighbors, the local structure is estimated and represented as a customized convolutional kernel. The spectral-spatial feature is obtained by a convolutional operation between the estimated kernel and the corresponding spectral features within a neighborhood. At last, the label of the test pixel is predicted by classifying the resulting spectral-spatial feature. Without increasing the number of training samples or involving pixel patches at the training stage, the CSFF framework achieves the state-of-the-art by declining $20\%-50\%$ classification failures in experiments on three well-known hyperspectral images.
CVNov 20, 2017
Spectral-Spatial Feature Extraction and Classification by ANN Supervised with Center Loss in Hyperspectral ImageryAlan J. X. Guo, Fei Zhu
In this paper, we propose a spectral-spatial feature extraction and classification framework based on artificial neuron network (ANN) in the context of hyperspectral imagery. With limited labeled samples, only spectral information is exploited for training and spatial context is integrated posteriorly at the testing stage. Taking advantage of recent advances in face recognition, a joint supervision symbol that combines softmax loss and center loss is adopted to train the proposed network, by which intra-class features are gathered while inter-class variations are enlarged. Based on the learned architecture, the extracted spectrum-based features are classified by a center classifier. Moreover, to fuse the spectral and spatial information, an adaptive spectral-spatial center classifier is developed, where multiscale neighborhoods are considered simultaneously, and the final label is determined using an adaptive voting strategy. Finally, experimental results on three well-known datasets validate the effectiveness of the proposed methods compared with the state-of-the-art approaches.