Why pangenomics?
The Human Genome Project's assembly of the first genome took 13 years and cost $2.7 billion. Today, consumers can sequence their whole genomes for as little as $300. Yet most biomedical researchers continue to use a single linear reference genome to assemble larger genome sequences, detect gene variation, and determine gene function.
Who defines this reference genome?
How is it possible to capture the entire diversity of a species with an arbitrary single linear sequence?
In contrast, a pangenome is a collection of genomes organized into a graph data structure to be used for holistic genetic analysis. It is hard to overstate the impact of moving from a sequence-based to a graph-based view of genetics. A traditional single sequence-based reference genome might be missing up to 10% of the genetic sequence in under-represented populations these missing sequences are the unexplored `dark matter` of the genetic universe. Today, all human DNA variants are identified comparing with this standard reference genome. Even though the `standard' reference was a massive step forward at the time, it is now known to be biased and misrepresents whole populations: for example, 10% of African DNA sequence does not align to the current reference genome.
For other populations the gap may be smaller but the reference approach is biased and therefore impacts genetical studies of underrepresented populations and even supposedly well represented populations.
This unexplored genomics `dark matter' might include beneficial or harmful traits that traditional techniques are not equipped to study. For example, recent work suggests that a pangenomic approach could have more efficiently discovered a critical genetic variant in the Icelandic population reducing the risk of heart attacks and a different critical variant increasing the risk of dementia.
Pangenomics will transform the software stack in computational biology, requiring new graph data structures and new algorithms for graph construction, alignment, annotation, and visualization.
We expect pangenome tools to replace reference-based approaches used in most genomic workflows today. As an example of this work and the potential impact of computational pangenomics, consider the pangenome graph for chromosome 20 very recently constructed for the HPRC. Computational pangenomics revealed a hairball in the graph that indicates an unexpected amount of variation in the previously-inaccessible centromeric region.
Further reading
- Eizenga, J. M. and Novak, A. M. and Sibbesen, J. A. and Heumos, S. and Ghaffaari, A. and Hickey, G. and Chang, X. and Seaman, J. D. and Rounthwaite, R. and Ebler, J. and Rautiainen, M. and Garg, S. and Paten, B. and Marschall, T. and Siren, J. and Garrison, E., Pangenome Graphs, Annual Review of Genomics and Human Genetics, 2020.
- Sherman, R. M. and Salzberg, S. L., Pan-Genomics in the Human Genome Era, Nature Reviews Genetics, 2020.