DNA Sequencing Overlap Assembly
The bioinformatics that sequenced the human genome and COVID-19 — reading the code of life
What is it?
Imagine a book torn into thousands of strips, each only a few words long. The strips overlap — the last word of one strip matches the first word of the next. Your job is to find the overlaps and reassemble the book.
DNA sequencers work the same way. They can only read short fragments (50-500 base pairs). By finding where fragments overlap, scientists assemble entire genomes. In this cipher, the message is encoded as DNA codons (3 bases per character) and fragmented into short reads.
Concrete Example
Message: GENE
ASCII: G=71, E=69, N=78, E=69 DNA codons (base-4 encoding): 71 % 4 = 3 → T, (71//4)%4 = 3 → T, (71//16)%4 = 2 → G → "TTG" 69 % 4 = 1 → A, (69//4)%4 = 1 → A, (69//16)%4 = 2 → G → "AAG" 78 % 4 = 2 → G, (78//4)%4 = 1 → A, (78//16)%4 = 2 → G → "GAG" 69 % 4 = 1 → A, (69//4)%4 = 1 → A, (69//16)%4 = 2 → G → "AAG" Full DNA: TTGAAGGAGAAG Reads (length 5, step 3): TTGAA, AAGGA, GGAGA, GAAG Assemble overlaps → TTGAAGGAGAAG → split into codons → decode
How It Works
- The answer's ASCII codes are converted to DNA using a base-4 encoding scheme
- Each character → 3 DNA bases (ATCG)
- The full DNA string is fragmented into short overlapping reads
- Reads are shuffled — you must assemble them by overlap
- The assembled string is split into 3-base codons
- Each codon is decoded: value = base0 + base1×4 + base2×16 → ASCII
Step-by-Step Solving
def overlap(a, b):
"""Find overlap length where end of a matches start of b."""
for i in range(len(b), 0, -1):
if a.endswith(b[:i]):
return i
return 0
def assemble(reads):
genome = reads[0]
used = {0}
while len(used) < len(reads):
best, best_ol, best_i = 0, 0, -1
for i, read in enumerate(reads):
if i in used: continue
ol = overlap(genome, read)
if ol > best_ol:
best, best_ol, best_i = read, ol, i
if best_i >= 0:
genome += best[best_ol:]
used.add(best_i)
return genome
# Decode: 3 bases per codon
bases = "ATCG"
codon_val = {b: i for i, b in enumerate(bases)}
def decode(dna):
chars = []
for i in range(0, len(dna), 3):
if i+3 > len(dna): break
val = codon_val[dna[i]] + codon_val[dna[i+1]]*4 + codon_val[dna[i+2]]*16
chars.append(chr(val))
return ''.join(chars)
Real-World Applications
- Human Genome Project: First complete human genome (2003) assembled from overlapping shotgun reads using this method
- COVID-19: The SARS-CoV-2 genome was sequenced and assembled within weeks of the outbreak using overlap assembly
- Ancient DNA: Neanderthal and Denisovan genomes were assembled from highly degraded, short DNA fragments
- Metagenomics: Sequencing bacterial communities in your gut by assembling all genomes simultaneously
- Cancer genomics: Assembling tumor genomes to find mutations driving cancer growth