Instruction
DNA Storage System
ENCODE
Encoding module
Polus exposes a common API around multiple DNA encoding/decoding schemes (“codecs”). As a demo, we first integrated three representative codecs—DNA Fountain, Yin–Yang (YYC), and Derrick—by re-implementing each per its published specification and then wrapped with a common Polus API for seamless switching. All encoders start from an input file and produce a collection of DNA oligonucleotide sequences with addressing and redundancy.
DNA Fountain (outer LT code with inner RS). We followed Erlich & Zielinski’s architecture, in which each droplet carries a short inner Reed–Solomon (RS) check and the outer Luby-Transform (LT) fountain code resolves erasures at the fragment level. In the canonical configuration, each droplet comprises 4 bytes seed + 32 bytes payload + 2 bytes RS parity (total 38 bytes), and decoding admits only RS-validated droplets into the LT peeling solver. Logical redundancy is set by the droplet budget; in practice we target low single-digit overhead, and our runs used ≈5–8%—in line with the ~7% overhead reported in the original work— by adjusting the number of generated droplets (seed schedule) at fixed constraints (GC contents, homopolymer lengths).
Yin–Yang codec. YYC uses two complementary mapping rules (“yin” and “yang”) to encode 2 bits per nucleotide, satisfying biochemical constraints (GC, homopolymers, secondary structure). We implemented YYC per the Ping et al.’s description and appended an outer RS(255,223) over GF(2^8) to the emitted strands (≈12.5% parity), leaving the YYC transcoding itself unchanged.
Derrick. Derrick introduces soft-decision decoding for DNA storage by predicting error locations/likelihoods and feeding them to RS decoders as erasures. We adopted the published pipeline—consensus, confidence estimation, and erasure-aware RS— and used RS(255,k) families matched to our comparison settings (default RS(255,223) unless otherwise noted) so that YYC and Derrick carry comparable outer-code overheads.
Constraint screening and outputs. After encoding, Polus applies the same constraint filters across codecs (GC window and homopolymer limit) before emitting FASTA/FASTQ. The platform then compiles an encoding report with: (i) logical density (bits nt⁻¹, counting all synthesized nt), (ii) encoding efficiency (speed and computational time taken) and (iii) sequence-property distributions (GC content and homopolymers run-length). These reports ensure that differences observed downstream (e.g., coverage required for full recovery) can be traced to codec design.
Implementation notes. (i) Our DNA Fountain implementation uses the published droplet format (4 B seed, 32 B payload, 2 B RS) and standard LT peeling; redundancy is controlled by droplet count. (ii) YYC transcoding follows the published dual-rule mapping; the outer RS layer is our evaluation choice to equalize FEC budgets across codecs. (iii) Derrick’s soft-decision RS matches the NSR reference.
In silico DNA channel simulation
Here we implemented a multi-stage in silico simulation pipeline that mimics the DNA storage channel: from synthesis through storage/aging, PCR amplification, and sequencing. This simulation is critical for benchmarking because fully experimental tests are time-consuming and expensive. Polus’s simulator is modular, allowing different error models or parameters to be plugged in for each stage:
Synthesis errors: We modeled oligo synthesis using a stochastic simulator (dt4dds) that introduces context-dependent substitution, insertion, and deletion errors during oligo writing. Modify the included parameters as needed.
Storage and decay: After synthesis, oligos may be stored for varying periods or conditions. Our simulator can optionally model DNA decay, including random strand breaks and base damage (e.g., C→T deamination). For accelerated aging tests, we increased these damage rates to mimic years or decades of storage.
PCR amplification: Prior to sequencing, DNA oligos are typically PCR amplified to create a sequencing library. PCR can introduce bias – certain sequences amplify more or less efficiently – and errors – polymerase errors can introduce substitutions (at rates ~10⁻⁵–10⁻⁶ per base per cycle for high-fidelity enzymes) or indels (much rarer). We included a simplified PCR model where each oligo is randomly assigned a coverage multiplier (some oligos end up overrepresented, others underrepresented in the sequencing input), with a dispersion factor reflecting empirically observed PCR bias.
Sequencing errors: Polus supports different sequencing models. For Illumina (short-read) sequencing, we used dt4dds in a sequencing mode to introduce predominantly substitution errors at a target rate (e.g., 0.1% per base) and small indel errors. For Oxford Nanopore (long-read) sequencing, we leveraged the read simulator Badread to generate long reads with typical Nanopore error characteristics (~5–10% errors, with an indel:substitution ratio ≈3:1).
The output of the simulator is a set of sequencing read files (FASTQ format with quality scores) for each simulated sequencing run.
Soft-decision decoding implementation
Polus’s soft-decision decoding module was implemented for each codec as follows. After SeqFormer produces a consensus sequence and an error probability for each base of an oligo, we translated this into inputs for the codec’s decoder:
RS codes (used in the Yin–Yang outer layer, in DNA Fountain inner layer and in Derrick alone): We set a threshold (e.g., 0.3) for “high error probability.” Any base with an error likelihood above this threshold was treated as an erasure in the RS decoding step. The RS decoder was modified to accept erasures in its decoding algorithm. If the number of erasures + ½ × errors was within the code’s capability (e.g., RS(255,223) can correct up to 16 errors or 32 erasures), decoding succeeded. By tuning the threshold, we ensured that we only mark truly ambiguous bases as erasures to avoid unnecessary erasures (which could reduce the code’s parity budget). In practice, we found a clear separation: SeqFormer’s confidence scores tend to be very high (> 0.99) for correct bases and low (< 0.5) for the few error positions, so setting approx. 0.6 worked well (nearly all actual error positions were marked, with few false marks).
DNA Fountain. SeqFormer supplies position-wise base probabilities that we aggregate to byte-level reliabilities for each droplet. From these, we mark low-confidence byte positions as erasures and run an erasure-aware RS decoder on the droplet. Only droplets whose RS soft-decoding succeeds (i.e., passes syndrome check after correcting errors/erasures) are admitted to the outer Fountain decoder; droplets that fail RS are discarded rather than injected into LT equations. The LT decoder then performs iterative belief-propagation solving for source fragments. By filtering and rescuing droplets at the RS layer (via soft erasures) rather than hard-rejecting them, the Fountain stage receives more usable droplets at a given sequencing coverage, which increases the probability of completing the system of equations at the same logical overhead.
Yin–Yang codec. We applied SeqFormer to each strand pair separately. For each scheme, the decoded file was verified using a checksum or hash. Polus implements an automatic CRC32 verification of the reconstructed data. The inner Yin–Yang mapping itself is a deterministic transcoding, so no probabilistic step there, but we again used an outer RS(255,223) across the data blocks. As with the RS codes case, we marked high-likelihood error positions as erasures before RS decoding.
Evaluation Module
Polus evaluates DNA data storage performance across multiple dimensions to capture both information-theoretic and biochemical aspects:
- Data File Recovery:
This category includes:
- Bit Error Rate (BER) – the fraction of bits that are incorrect in the final reconstructed file (after decoding) relative to the original input.
- Strand (or Block) Recovery Rate – the percentage of oligo strands (or codeword blocks) that are decoded without any errors.
A 0% BER and 100% strand recovery rate means the file was perfectly reconstructed. These metrics directly measure decoder success.
- Logical Density:
The number of information bits per nucleotide synthesized. Let Binfo be the total information bits in the input file(s), and Nsyn be the total number of synthesized nucleotides (sum over all oligos, including addressing and ECC). The logical density is given by:
Dlogical = Binfo / Nsyn (bits/nt)
- Physical Density:
The amount of data stored per unit mass of DNA, typically in bytes per gram.
Under the ideal dry-DNA assumption, the number of nucleotides per gram is:
Nnt/g = (NA / Mnt)
Where NA is the Avogadro constant and Mnt is the average molar mass per nucleotide of ssDNA (≈330 g/mol for a single nucleotide). Thus, the ideal physical density is:
Dphys,ideal = (Binfo / Nsyn) × (NA / Mnt)
Then for actual physical density, reliable readout requires multiple copy numbers per oligo, denoted as Cmin. We take this as the minimal average coverage yielding error-free recovery (from our in silico experiments). Under assumptions similar to Church et al. (2012) – e.g., ~100 molecules/oligo, no synthesis or long-term loss (effective yield ≈1) – the actual density reduces to:
Dphys,actual = Dphys,ideal / Cmin
- Sequencing Depth Requirement:
The minimum average sequencing coverage per oligo required to consistently recover the data without errors. We determine this by running simulations with decreasing read depths until decoding fails.
- Encoding Speed:
The throughput of the encoding process per second. We measured wall-clock times of our encoder implementations on a XXX CPU to compare their efficiency.
- Decoding Speed:
The throughput of the decoding process per second. We measured wall-clock times for decoding on the same hardware. For Polus, we separately recorded the time for clustering, SeqFormer inference, and error-correction decoding.
- Cost per MB:
An estimate of the monetary cost to store and retrieve 1 MB of data using each scheme, factoring in:
- DNA synthesis cost per base (≈$0.10)
- DNA sequencing cost per read/coverage (≈$1000 per ~1B Illumina reads, ≈$30/Gb for Nanopore MinION)
- GC Content:
The average (and distribution of) GC fraction in the encoded oligos. All our schemes enforce roughly 45–55% GC, but this is reported to ensure none produce extreme GC outliers that may hinder PCR or sequencing.
- Max Homopolymer Length:
The longest run of identical nucleotides in any encoded oligo. This reflects adherence to biochemical constraints, since homopolymers longer than ~5 bases can increase sequencing error rates.
Reference
- Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science. 2017; 355(6328):950-954.
- Ping, Z., Chen, S., Zhou, G., Huang, X., Zhu, S., Zhang, H., Lee, H., Lan, Z., Cui, J., Chen, T., Zhang, W., Yang, H., Xu, X., Church, G., Shen, Y. Towards practical and robust DNA-based data archiving using the Yin–Yang Codec system. Nature Computational Science. 2022; 2:234-242.
- Ding, L., Wu, S., Hou, Z., Li, A., Xu, Y., Feng, H., Pan, W., & Ruan, J. Improving error-correcting capability in DNA digital storage via soft-decision decoding (Derrick). National Science Review. 2024; 11(2):nwad229.
- Press, W. H., Hawkins, J. A., Jones, S. K., Schaub, J. M., & Finkelstein, I. J. HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. Proceedings of the National Academy of Sciences. 2020; 117(31):18489-18496.
Note: for the algorithms implemented in this tool (Fountain, Yin–Yang Codec, Derrick, HEDGES), please refer to the above references for details and statements of competing interests.