BRANCH — a genome assembler for rare somatic copy-number changes

// Why CNVs

Two copies of every gene? Mostly. Not always.

What you learned in school — one copy from each parent — is the average case, not the rule. Lots of regions in the human genome don't sit at exactly two copies, and a surprising amount of variation isn't even fixed at birth: parts of the genome change copy count over a lifetime, in some cells but not others.

AMY1 · between 2 and 17 copies

The gene that makes the amylase in your saliva ranges from 2 to 17 copies between people. Populations whose ancestors ate a lot of starch carry on average about six copies; populations with traditionally low-starch diets carry about three. One of the first clear cases of recent positive selection acting on a copy-number variant.

CYP2D6 · drug metabolism

One of the main enzymes that breaks down medications — including codeine, tamoxifen, and many antidepressants. Copy-number variation across the population means the same dose can be subtherapeutic in one patient and toxic in another. Entire prescribing guidelines exist around CYP2D6 status.

Loss of chromosome Y · >40% of men over 70

More than 40% of men over 70 have lost the Y chromosome from a substantial fraction of their blood cells — the rest of the body still carries it. The phenomenon is associated with increased risk of cancer, cardiovascular disease and Alzheimer's. By definition, it lives in only some of the cells in a sample.

Clonal hematopoiesis · 10-20% of older adults

10-20% of people over 70 carry blood cells from an expanded sub-clone — descendants of a single stem cell that picked up a mutation and outgrew its neighbours. Some of those clones carry copy-number changes the rest of the body never had. Standard germline pipelines see one averaged answer; the sub-clone is the part that matters.

So — are you sure every region you care about is stable enough to assume one answer per haplotype?

Standard germline pipelines say yes. By design. They were built to find the consensus, not the disagreement. BRANCH is built for the disagreement.

// What it does

The idea

DNA sequencing returns millions of overlapping pieces. Stitching them back into chromosomes is the hard part — the genome repeats itself in many places, and the copy-number changes that matter most clinically often live in only a small fraction of the cells.

What standard pipelines still collapse

A modern assembler can already separate the two parental haplotypes. But once a region carries more than two versions — three near-identical paralog copies, an extra copy gained in some cells, a tandem expansion only a fraction of the sample shows — most pipelines collapse it down to one consensus per haplotype, and the sub-haplotype variation gets averaged in.

BRANCH keeps each version

Where reads disagree, BRANCH leaves both versions in the assembly graph as parallel branches — not just two haplotypes, but every version of the region the reads actually support. Each branch carries its supporting read count and an inferred copy count, so a sub-clonal extra copy shows up as its own branch with its own evidence.

Copies counted per branch

Each branch in the graph carries its own copy-count estimate. So a gene present at different copy counts on the two chromosomes — or expanded in only a fraction of cells — appears as separate branches with separate counts, instead of being merged into one number for the whole region.

A concrete example

A blood sample contains a small subset of cells in which one immune-system gene has gained an extra copy compared to the rest of the sample. A standard germline pipeline returns one answer at that locus per haplotype — the sub-population's extra copy blends into the majority and the difference is gone. In BRANCH, the same locus has more than one branch in the assembly graph: a main branch from most of the reads, with the usual copy count, and a second branch from a minority of reads, with one extra copy. The rare population is now visible — and the read count on each branch tells you how rare.

// How it works

Five stages from reads to a branched graph

Sequencing reads from one sample go in. An assembly graph (GFA) plus per-branch consensus (FASTA), alignments (PAF) and branch intervals (BED) come out — every branch tagged with read support and an inferred copy count.

1 · read

Reads are loaded from the sequencing files (BAM or FASTQ) and indexed for overlap detection.

2 · overlap

Reads that share an identifying stretch (a minimizer) are linked. The result is a graph in which every read is a path; branches appear wherever reads disagree.

3 · simplify

Long stretches where every read agrees collapse into one straight piece. Where reads disagree — including sub-haplotype disagreements between just a few reads — both versions stay as parallel branches.

4 · clean

Redundant edges and reads contained inside other reads are removed; the graph keeps only the structure needed to describe everything observed.

5 · annotate

Each branch gets a consensus sequence, its position on the standard human reference, the count of reads that support it, and an inferred copy count from depth.

// Design

Lossless graph. Evidence on every branch.

The assembly graph is BRANCH's output, not an intermediate step. Branches that other pipelines collapse into a single consensus stay in the output, each one tagged with the reads that support it, an inferred copy count, and where it sits on the standard reference.

Reads are routed, not filtered

Reads are routed through the graph instead of filtered out as noise. Every read backing a branch can be traced to the variation it supports — so the read count on each branch is a direct tally, not a downstream model estimate.

Copy count per branch

Each branch carries its own copy-count estimate, derived from read depth normalised against parts of the reference known to exist in a single copy. A region that appears at different counts on the two chromosomes, or that has gained a copy in a sub-population of cells, ends up as separate branches with separate counts.

Read support on every branch point

Every place where the graph splits is annotated with how many reads went each way. That turns a low-frequency change from something arguable into a number on the page — and a list of supporting reads you can pull up.

Calls with traceable evidence

Each named call — somatic CNV, alternate haplotype, paralog copy — is recorded together with the rule that produced it and the evidence behind it: supporting reads, in-silico PCR amplicons across the call, and k-mer counts on the read sequence.

Anchored to a real reference

Every branch is placed against the standard human reference (GRCh38 and CHM13). Projection onto the human pangenome ships as branch project.

SLURM-friendly, filesystem-agnostic

A SLURM driver scaffold lives in the repo (workflow/). The pipeline is filesystem-agnostic — point it at your reads and an output directory, on a laptop for one sample or on a research cluster for many.

// Output

What you actually get back

BRANCH is one program with three subcommands. They all work on the same underlying assembly graph.

$ branch assemble

Reads in, graph out

Takes sequencing files for one sample (BAM or FASTQ) and writes an assembly graph (GFA) plus per-branch consensus (FASTA), alignments (PAF), and branch intervals (BED). Each branch carries its read support and inferred copy count.

$ branch analyze

Graph in, copy-number track out

Takes the graph and the alignments from the previous step and produces a copy-number track per branch, normalised against single-copy reference regions so two samples can be compared directly. Depth comes from mosdepth; paralog awareness is built in.

$ branch project

Graph in, comparison out

Maps each branch onto the human reference (CHM13 + GRCh38) and onto the human pangenome (HPRC v1.1), and reports the difference between your sample and the closest known sequence — flagging novel rearrangements as novel.

// Open source

For studies where the rare copies are the answer.

BRANCH is in active development. It is built for the kind of studies where the rare cell populations are the answer — mosaic copy-number changes carried by only part of a tissue, immune-locus rearrangements that vary between cell populations, somatic events acquired later in life. Single-base changes are picked up as branches as well, but the design target is somatic copy-number variation.

Open on GitHub Browse the output →