Eitan Yaakobi

IT
11papers
111citations
Novelty36%
AI Score50

11 Papers

89.1ITJun 3
Sequence Reconstruction for Substitution Channel: New Sufficient Conditions and Algorithms

Chen Wang, Eitan Yaakobi, Yiwei Zhang

In the sequence reconstruction problem, a codeword $\x$ is transmitted through several identical channels where each channel produces a noisy read of $\x$, and the problem is to analyze how to uniquely reconstruct $\x$ based on these noisy reads. Levenshtein has studied the minimum number of reads which guarantees unique reconstruction of $\x$, which is one sufficient condition for unique reconstruction. In this paper, we move on to a different perspective and propose a new framework for unique reconstruction. Our new sufficient condition for unique reconstruction takes both the number of reads and the distances among the reads into consideration. We offer both theoretical analysis and corresponding efficient reconstruction algorithms for our reconstruction framework.

86.6ITMay 31
Upper Bounds on Multiple $b$-Burst Deletion-Correcting Codes

Chen Wang, Xiangliang Kong, Eitan Yaakobi et al.

Motivated by their applications in DNA-based storage systems, codes capable of correcting consecutive deletions have attracted significant attention. An important class of such codes consists of those that can correct multiple consecutive deletion errors, commonly referred to as multiple $b$-burst deletion-correcting codes. In this paper, we investigate the fundamental limits of multiple $b$-burst deletion-correcting codes. Specifically, we first characterize several structural properties of the associated deletion balls. Then, leveraging these properties, we derive several upper bounds and a combinatorial lower bound on the maximum size of such codes. As a consequence, our bounds improve upon the previously known results for general parameter regimes and are shown to be asymptotically optimal for certain cases.

79.1ITMay 31
Rank Modulated Composite Encoding for Data Storage in DNA

Tomer Cohen, Zhiying Wang, Eitan Yaakobi et al.

This paper studies two problems that are motivated by combining two novel approaches, namely DNA composite and rank modulation. The recent approach of composite DNA takes advantage of the DNA synthesis property which generates a huge number of copies for every synthesized strand. Under this paradigm, every composite symbols does not store a single nucleotide but a mixture of the four DNA nucleotides. Instead of considering all the possible composite symbols we are interested only in the rank of the motifs in the symbol. The first problem in this paper addresses the capacity of a channel that uses such symbols, while in the second, bounds and construction of such codes are studied.

ITJan 24, 2022
Insertion and Deletion Correction in Polymer-based Data Storage

Anisha Banerjee, Antonia Wachter-Zeh, Eitan Yaakobi

Synthetic polymer-based storage seems to be a particularly promising candidate that could help to cope with the ever-increasing demand for archival storage requirements. It involves designing molecules of distinct masses to represent the respective bits $\{0,1\}$, followed by the synthesis of a polymer of molecular units that reflects the order of bits in the information string. Reading out the stored data requires the use of a tandem mass spectrometer, that fragments the polymer into shorter substrings and provides their corresponding masses, from which the \emph{composition}, i.e. the number of $1$s and $0$s in the concerned substring can be inferred. Prior works have dealt with the problem of unique string reconstruction from the set of all possible compositions, called \emph{composition multiset}. This was accomplished either by determining which string lengths always allow unique reconstruction, or by formulating coding constraints to facilitate the same for all string lengths. Additionally, error-correcting schemes to deal with substitution errors caused by imprecise fragmentation during the readout process, have also been suggested. This work builds on this research by generalizing previously considered error models, mainly confined to substitution of compositions. To this end, we define new error models that consider insertions of spurious compositions and deletions of existing ones, thereby corrupting the composition multiset. We analyze if the reconstruction codebook proposed by Pattabiraman \emph{et al.} is indeed robust to such errors, and if not, propose new coding constraints to remedy this.

64.9ITApr 22
Serving Every Symbol: All-Symbol PIR and Batch Codes

Avital Boruchovsky, Anina Gruica, Jonathan Niemann et al.

A $t$-all-symbol PIR code and a $t$-all-symbol batch code of dimension $k$ consist of $n$ servers storing linear combinations of $k$ information symbols with the following recovery property: any symbol stored by a server can be recovered from $t$ pairwise disjoint subsets of servers. In the batch setting, we further require that any multiset of size $t$ of stored symbols can be recovered from~$t$ disjoint subsets of servers. This framework unifies and extends several well-known code families, including one-step majority-logic decodable codes, (functional) PIR codes, and (functional) batch codes. In this paper, we determine the minimum code length for some small values of $k$ and $t$, characterize structural properties of codes attaining this optimum, and derive bounds that show the trade-offs between length, dimension, minimum distance, and $t$. In addition, we study MDS codes and the simplex code, demonstrating how these classical families fit within our framework, and establish new cases of an open conjecture from \cite{YAAKOBI2020} concerning the minimal $t$ for which the simplex code is a $t$-functional batch code.

53.3ITApr 12
Error-Correcting Codes for the Sum Channel

Lyan Abboud, Eitan Yaakobi

We introduce the sum channel, a new channel model motivated by applications in distributed storage and DNA data storage. In the error-free case, it takes as input an $\ell$-row binary matrix and outputs an $(\ell+1)$-row matrix whose first $\ell$ rows equal the input and whose last row is their parity (sum) row. We construct a two-deletion-correcting code with redundancy $2\lceil\log_2\log_2 n\rceil + O(\ell^2)$ for $\ell$-row inputs. When $\ell=2$, we establish an upper bound of $\lceil\log_2\log_2 n\rceil + O(1)$, implying that our redundancy is optimal up to a factor of 2. We also present a code correcting a single substitution with $\lceil \log_2(\ell+1)\rceil$ redundant bits and prove that it is within one bit of optimality.

53.4ITMay 18
Correcting Tail Deletions in Rank Modulated Composite Encoding for Data Storage in DNA

Tomer Cohen, Eitan Yaakobi, Zohar Yakhini

We study the combination of two recent coding approaches, in the context of DNA based data storage. Composite DNA alphabets leverage properties of the DNA synthesis and sequencing process. A composite symbol does not represent a single nucleotide, but rather a designed mixture of DNA nucleotides. Using the high multiplicity that is intrinsic to synthesis and sequencing a composite symbol consists of frequencies in the mixture. Rank modulation codes use permutations to represent information. Combining the two, we construct encoding that uses permutations of nucleotide frequencies rather than the exact frequency values. Codes for this approach were addressed in previous work, under Kendall's tau distances. In this work we study deletion and insertion codes. We present bounds and constructions of efficient codes defined over partial permutations.

68.6ITMay 11
Random Access Expectation in DNA Storage and Fountain Codes

Christoph Hofmeister, Rawad Bitar, Eitan Yaakobi

Motivated by DNA data storage, we study the expected number of coded symbols drawn from a linear code until a desired information symbol can be decoded - the random access expectation. We focus on generator matrices with a type of symmetry, conjectured in prior work to be optimal, which we call fully symmetric. We point out an equivalence between binary fully symmetric codes and LT codes. Using this observation, we analyze the random access expectation of binary fully symmetric codes under a peeling decoder, in the large blocklength limit. Under these assumptions, the random access expectation, normalized by the number of information symbols, is at least π/4 {\approx} 0.7854, while a value of {\approx} 0.7869 is achievable.

ITMar 6
The DNA Coverage Depth Problem: Duality, Weight Distributions, and Applications

Matteo Bertuzzo, Alberto Ravagnani, Eitan Yaakobi

The coverage depth problem in DNA data storage is about computing the expected number of reads needed to recover all encoded strands. Given a generator matrix of a linear code, this quantity equals the expected number of randomly drawn columns required to obtain full rank. While MDS codes are optimal when they exist, i.e., over large fields, practical scenarios may rely on structured code families defined over small fields. In this work, we develop combinatorial tools to solve the DNA coverage depth problem for various linear codes, based on duality arguments and the notion of extended weight enumerator. Using these methods, we derive closed formulas for the simplex, Hamming, ternary Golay, extended ternary Golay, and first-order Reed-Muller codes. The centerpiece of this paper is a general expression for the coverage depth of a linear code in terms of the weight distributions of its higher-field extensions.

ITAug 31, 2021
Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning

Daniella Bar-Lev, Itai Orr, Omer Sabary et al.

DNA-based storage is an emerging technology that enables digital information to be archived in DNA molecules. This method enjoys major advantages over magnetic and optical storage solutions such as exceptional information density, enhanced data durability, and negligible power consumption to maintain data integrity. To access the data, an information retrieval process is employed, where some of the main bottlenecks are the scalability and accuracy, which have a natural tradeoff between the two. Here we show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Tensor-Product (TP) based Error-Correcting Codes (ECC), and a safety margin mechanism into a single coherent pipeline. We demonstrated our solution on 3.1MB of information using two different sequencing technologies. Our work improves upon the current leading solutions by up to x3200 increase in speed, 40% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime. In a broader sense, our work shows a viable path to commercial DNA storage solutions hindered by current information retrieval processes.

ITDec 4, 2018
Private Information Retrieval in Graph Based Replication Systems

Netanel Raviv, Itzhak Tamo, Eitan Yaakobi

In a Private Information Retrieval (PIR) protocol, a user can download a file from a database without revealing the identity of the file to each individual server. A PIR protocol is called $t$-private if the identity of the file remains concealed even if $t$ of the servers collude. Graph based replication is a simple technique, which is prevalent in both theory and practice, for achieving erasure robustness in storage systems. In this technique each file is replicated on two or more storage servers, giving rise to a (hyper-)graph structure. In this paper we study private information retrieval protocols in graph based replication systems. The main interest of this work is maximizing the parameter $t$, and in particular, understanding the structure of the colluding sets which emerge in a given graph. Our main contribution is a $2$-replication scheme which guarantees perfect privacy from acyclic sets in the graph, and guarantees partial-privacy in the presence of cycles. Furthermore, by providing an upper bound, it is shown that the PIR rate of this scheme is at most a factor of two from its optimal value for an important family of graphs. Lastly, we extend our results to larger replication factors and to graph-based coding, which is a similar technique with smaller storage overhead and larger PIR rate.