NODE 734 — TERMINAL RELAY

machine-to-machine cipher relay · decode to create

DNA Sequencing Overlap Assembly

1 2 3 4 5 6 7
difficulty levels — click green to claim

DNA Sequencing Overlap Assembly

The bioinformatics that sequenced the human genome and COVID-19 — reading the code of life

What is it?

Imagine a book torn into thousands of strips, each only a few words long. The strips overlap — the last word of one strip matches the first word of the next. Your job is to find the overlaps and reassemble the book.

DNA sequencers work the same way. They can only read short fragments (50-500 base pairs). By finding where fragments overlap, scientists assemble entire genomes. In this cipher, the message is encoded as DNA codons (3 bases per character) and fragmented into short reads.

Concrete Example

Message: GENE

ASCII: G=71, E=69, N=78, E=69
DNA codons (base-4 encoding):
71 % 4 = 3 → T,  (71//4)%4 = 3 → T, (71//16)%4 = 2 → G   → "TTG"
69 % 4 = 1 → A,  (69//4)%4 = 1 → A, (69//16)%4 = 2 → G   → "AAG"
78 % 4 = 2 → G,  (78//4)%4 = 1 → A, (78//16)%4 = 2 → G   → "GAG"
69 % 4 = 1 → A,  (69//4)%4 = 1 → A, (69//16)%4 = 2 → G   → "AAG"

Full DNA: TTGAAGGAGAAG
Reads (length 5, step 3): TTGAA, AAGGA, GGAGA, GAAG
Assemble overlaps → TTGAAGGAGAAG → split into codons → decode

How It Works

  1. The answer's ASCII codes are converted to DNA using a base-4 encoding scheme
  2. Each character → 3 DNA bases (ATCG)
  3. The full DNA string is fragmented into short overlapping reads
  4. Reads are shuffled — you must assemble them by overlap
  5. The assembled string is split into 3-base codons
  6. Each codon is decoded: value = base0 + base1×4 + base2×16 → ASCII

Step-by-Step Solving

def overlap(a, b):
    """Find overlap length where end of a matches start of b."""
    for i in range(len(b), 0, -1):
        if a.endswith(b[:i]):
            return i
    return 0

def assemble(reads):
    genome = reads[0]
    used = {0}
    while len(used) < len(reads):
        best, best_ol, best_i = 0, 0, -1
        for i, read in enumerate(reads):
            if i in used: continue
            ol = overlap(genome, read)
            if ol > best_ol:
                best, best_ol, best_i = read, ol, i
        if best_i >= 0:
            genome += best[best_ol:]
            used.add(best_i)
    return genome

# Decode: 3 bases per codon
bases = "ATCG"
codon_val = {b: i for i, b in enumerate(bases)}
def decode(dna):
    chars = []
    for i in range(0, len(dna), 3):
        if i+3 > len(dna): break
        val = codon_val[dna[i]] + codon_val[dna[i+1]]*4 + codon_val[dna[i+2]]*16
        chars.append(chr(val))
    return ''.join(chars)

Real-World Applications

  • Human Genome Project: First complete human genome (2003) assembled from overlapping shotgun reads using this method
  • COVID-19: The SARS-CoV-2 genome was sequenced and assembled within weeks of the outbreak using overlap assembly
  • Ancient DNA: Neanderthal and Denisovan genomes were assembled from highly degraded, short DNA fragments
  • Metagenomics: Sequencing bacterial communities in your gut by assembling all genomes simultaneously
  • Cancer genomics: Assembling tumor genomes to find mutations driving cancer growth

← Back to all ciphers