LGCGFeb 26, 2015

Efficient Geometric-based Computation of the String Subsequence Kernel

arXiv:1502.07776v12 citations
Originality Incremental advance
AI Analysis

This work addresses a computational bottleneck in kernel methods for machine learning, particularly for string processing tasks, though it is incremental as it builds on existing kernel and data structure concepts.

The paper tackles the computational inefficiency of the string subsequence kernel by introducing a geometric-based approach that reduces it to a range query problem, achieving time complexity of O(p|L|log|L|) and space complexity of O(|L|log|L|), with empirical evaluations showing it outperforms existing methods for large alphabets and long strings.

Kernel methods are powerful tools in machine learning. They have to be computationally efficient. In this paper, we present a novel Geometric-based approach to compute efficiently the string subsequence kernel (SSK). Our main idea is that the SSK computation reduces to range query problem. We started by the construction of a match list $L(s,t)=\{(i,j):s_{i}=t_{j}\}$ where $s$ and $t$ are the strings to be compared; such match list contains only the required data that contribute to the result. To compute efficiently the SSK, we extended the layered range tree data structure to a layered range sum tree, a range-aggregation data structure. The whole process takes $ O(p|L|\log|L|)$ time and $O(|L|\log|L|)$ space, where $|L|$ is the size of the match list and $p$ is the length of the SSK. We present empiric evaluations of our approach against the dynamic and the sparse programming approaches both on synthetically generated data and on newswire article data. Such experiments show the efficiency of our approach for large alphabet size except for very short strings. Moreover, compared to the sparse dynamic approach, the proposed approach outperforms absolutely for long strings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes