Pangenomics Bio Hacking 2021

December 9-10, 2021, Online, Virtually in Milano


About

We are excited to organize the first world-wide conference on December 9-10 for researchers who have a focus on pangenomics free and open source software development

The meeting will be fully online, with mornings (Central European time zone) sessions dedicated to hacking, virtally taking place at University of Milano - Bicocca, Milano, Italy. Talks are either early morning (Asia time zones) or evening (USA time zones).

Participation is free, but registration is required.

Videos of all talks are online on Youtube

Location

Virtually in Milano Italy

Videos

Day 1: YouTube Stream Day 2: YouTube Stream

Chats

general

question to speakers

Speakers

  • 9h30 Advanced high-resolution approaches: human health and biodiversity genomics

    Shilpa Garg home page

    University of Copenhagen

  • 15h00 Representing 661,405 bacterial genomes using minimizer-space de Bruijn graphs

    Rayan Chikhi homepage

    Institut Pasteur and CNRS

  • 15h30 Establishing Bovine Pangenome Graphs

    Danang Crysnanto website

    ETH Zurich

  • 16h00 Giraffe: A Pangenomic Short Read Aligner

    Jouni Siren home page

    UC Santa Cruz

  • 16h30 gSV - a reference free SV caller

    Christian Kubica Github

    Max Planck Institute for Developmental Biology, Tuebingen, Germany

  • 9h30 Fast and memory-efficient partial order alignment with abPOA

    Yan Gao Github

    Children's Hospital of Philadelphia and Sun Yat-sen University

  • 16h00 PanGenie - Pangenome-based inference

    Jana Ebler website

    Heinrich Heine University Düsseldorf

  • 16h30 A pangenome for the expanded BXD family of mice

    Flavia Villani website

    University of Tennessee

  • 17h00 MONI: Design Challenges for a Pangenomic Tool

    Massimiliano Rossi home page

    University of Florida

Schedule

All times are Central European Time (CET).

Time Slot Description
Dec 09 - 9h00 Check-in
9h30 Shilpa Garg
Advanced high-resolution approaches: human health and biodiversity genomics
Chromosome-scale haplotypes are important to study genetic variation associated to diseases and biodiversity evolution. Advancements in third-generation sequencing opened enormous opportunities to reconstruct genomes at a single-base resolution. In this talk, I will present sequencing methods and graph-based approaches to combine the data types in an efficient manner. Further, I will provide examples relevant in clinical setting. I will demonstrate these comparative genomics.
10h00 Hackathon Hack Goal
12h00 Lunch
13h00 Hackathon Hack Goal
15h00 Rayan Chikhi
Representing 661,405 bacterial genomes using minimizer-space de Bruijn graphs
DNA sequencing continues to progress toward longer and more accurate reads. Yet, primary analyses, such as genome assembly and pangenome graph construction, remain challenging and energy-inefficient. We recently introduced the concept of minimizer-space sequencing analysis, expanding the alphabet of DNA sequences to atomic tokens made of fixed-length words. This leads to orders-of-magnitude improvements in speed and memory usage for human genome assembly and metagenome assembly and enables for the first time a representation of a pangenome made of 661,405 bacterial genomes.
15h30 Danang Crysnanto
Establishing Bovine Pangenome Graphs
The reference genome is generated from a single or few individuals and thus, it is a poor representation of the full species diversity. This pitfall has led to the development of the pangenome; the use of multiple genomes for genomic analysis. In this study, we integrated six cattle reference-quality genome assemblies into a pangenome graph, and we found 70 Mb sequences are not included in the existing cattle reference genome. We further demonstrated that these non-reference sequences contain functionally active bases and thousands of polymorphic sites that remain undetected with a single linear genome. Our findings call the need for a more representative reference genome that captures the entire species diversity.
16h00 Jouni Siren
Giraffe: A Pangenomic Short Read Aligner
Pangenomic aligners try to avoid reference bias in read mapping by including common variants in the reference. They achieve a higher mapping accuracy than traditional linear aligners, at the expense of being an order of magnitude slower. We present Giraffe, a new short read aligner that avoids this trade-off with a number of algorithmic and data model improvements. By combining the speed of linear aligners with the accuracy of pangenomic aligners, Giraffe is a viable choice for large-scale sequencing projects.
16h30 Christian Kubica
gSV - a reference free SV caller
While reference guided variant detection has helped us to explore a large fraction of, mostly non-complex, variation, its properties derived from pairwise comparisons have hindered true pan-genomic analysis. Until now, variant calls have always been reference biased and dependent on sequence being available in the singular reference genome. The coordinate system used has been reference based and thus the sequence context in the query sequence has been lost. Furthermore, nested variation has been hard to access from such variant calls. By using the information stored in whole genome alignment derived genome graphs we are now able to explore variation in a true pan-genomic environment. Here we present gSV, a first solution to call complex and nested variation as it is represented in a genome graph. gSV explores the graph inherent bubble structures from a fully pan-genomic point of view, thus removing the need for a reference coordinate system. This enables us to access nested variation and assign hierarchical parent-child relationships to bubbles while preserving the sequence context of each variant in every genome.
17h30 End of Day 1
Dec 10 - 9h00 Beginning of Day 2
9h30 Yan Gao
Fast and memory-efficient partial order alignment with abPOA
We present abPOA (adaptive banded Partial Order Alignment), a Single Instruction Multiple Data (SIMD)-based C library for fast and memory-efficient partial order alignment. abPOA uses a minimizer-based seeding and partition approach to split sequence and graph into small windows and separately performs partial order alignment within each window
10h00 Hackathon Hack Goal
12h00 Lunch
13h00 Hackathon Hack Goal
16h00 Jana Ebler
PanGenie - Pangenome-based inference
Typical analysis workflows map reads to a reference genome in order to genotype genetic variants. Generating such alignments introduces reference biases and comes with substantial computational burden. In contrast, recent k-mer based genotypers are fast, but struggle in repetitive or duplicated genomic regions. In this talk, I propose a novel algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference in conjunction with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation - a process we refer to as genome inference. Compared to mapping-based approaches, PanGenie is more than 4x faster at 30x coverage and reaches significantly better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (>=50bp) and variants in repetitive regions, enabling the inclusion of these classes of variants in genome-wide association studies. PanGenie efficiently leverages the increasing amount of haplotype-resolved assemblies to unravel the functional impact of previously inaccessible variants while being scalable to thousands of genotyped samples.
16h30 Flavia Villani
A pangenome for the expanded BXD family of mice
The members of BXDs family have been inbred for 20-200 generations . They are of great value for mapping complex traits and phenome-wide association analysis. Current genomic studies on BXD assume a single linear reference genome, making it difficult to observe sequences diverging from the reference, therefore limiting the accuracy and completeness of analyses. We sequenced all extant members of all BXD families using linked-read libraries and built the pangenome graph to study genetic variation. We determined that linked reads are not ideal for pangenome building, nevertheless the pangenome enhanced the calling of complex variants not seen by traditional genomics methods and provided calls with good precision and sensitivity. As a case study we followed up a strain specific 2kb insertion that was inherited by half of the recombinant mice and correlated the genotype at this locus with clinically relevant phenotypes present in the GeneNetwork database.
17h00 Massimiliano Rossi
MONI: Design Challenges for a Pangenomic Tool
MONI is a pangenomic index for finding maximal exact matches (MEMs), that is built on top of the r-index but also includes auxiliary data structures to allow finding approximate matches on a pangenomic scale. It includes many open-source tools and data structures such as BigBWT, BigRePair, and SDSL. I will talk about MONI, which is written in C++ and freely available at https://github.com/maxrossi91/moni
17h30 Discussion and Greetings

Sponsors

website source code

Organizers

Why pangenomics?

The Human Genome Project's assembly of the first genome took 13 years and cost $2.7 billion. Today, consumers can sequence their whole genomes for as little as $300. Yet most biomedical researchers continue to use a single linear reference genome to assemble larger genome sequences, detect gene variation, and determine gene function.

Who defines this reference genome?

How is it possible to capture the entire diversity of a species with an arbitrary single linear sequence?

In contrast, a pangenome is a collection of genomes organized into a graph data structure to be used for holistic genetic analysis. It is hard to overstate the impact of moving from a sequence-based to a graph-based view of genetics. A traditional single sequence-based reference genome might be missing up to 10% of the genetic sequence in under-represented populations these missing sequences are the unexplored `dark matter` of the genetic universe. Today, all human DNA variants are identified comparing with this standard reference genome. Even though the `standard' reference was a massive step forward at the time, it is now known to be biased and misrepresents whole populations: for example, 10% of African DNA sequence does not align to the current reference genome.

For other populations the gap may be smaller but the reference approach is biased and therefore impacts genetical studies of underrepresented populations and even supposedly well represented populations.

This unexplored genomics `dark matter' might include beneficial or harmful traits that traditional techniques are not equipped to study. For example, recent work suggests that a pangenomic approach could have more efficiently discovered a critical genetic variant in the Icelandic population reducing the risk of heart attacks and a different critical variant increasing the risk of dementia.

Pangenomics will transform the software stack in computational biology, requiring new graph data structures and new algorithms for graph construction, alignment, annotation, and visualization.

We expect pangenome tools to replace reference-based approaches used in most genomic workflows today. As an example of this work and the potential impact of computational pangenomics, consider the pangenome graph for chromosome 20 very recently constructed for the HPRC. Computational pangenomics revealed a hairball in the graph that indicates an unexpected amount of variation in the previously-inaccessible centromeric region.

Further reading

  1. Eizenga, J. M. and Novak, A. M. and Sibbesen, J. A. and Heumos, S. and Ghaffaari, A. and Hickey, G. and Chang, X. and Seaman, J. D. and Rounthwaite, R. and Ebler, J. and Rautiainen, M. and Garg, S. and Paten, B. and Marschall, T. and Siren, J. and Garrison, E., Pangenome Graphs, Annual Review of Genomics and Human Genetics, 2020.
  2. Sherman, R. M. and Salzberg, S. L., Pan-Genomics in the Human Genome Era, Nature Reviews Genetics, 2020.