Pangenomics Bio Hacking 2021

About

We are excited to organize the first world-wide conference on December 9-10 for researchers who have a focus on pangenomics free and open source software development

The meeting will be fully online, with mornings (Central European time zone) sessions dedicated to hacking, virtally taking place at University of Milano - Bicocca, Milano, Italy. Talks are either early morning (Asia time zones) or evening (USA time zones).

Participation is free, but registration is required.

Videos of all talks are online on Youtube

Location

Virtually in Milano Italy

Videos

Day 1: YouTube Stream Day 2: YouTube Stream

Chats

general

question to speakers

Speakers

9h30 Advanced high-resolution approaches: human health and biodiversity genomics

Shilpa Garg home page

University of Copenhagen
15h00 Representing 661,405 bacterial genomes using minimizer-space de Bruijn graphs

Rayan Chikhi homepage

Institut Pasteur and CNRS
15h30 Establishing Bovine Pangenome Graphs

Danang Crysnanto website

ETH Zurich
16h00 Giraffe: A Pangenomic Short Read Aligner

Jouni Siren home page

UC Santa Cruz
16h30 gSV - a reference free SV caller

Christian Kubica Github

Max Planck Institute for Developmental Biology, Tuebingen, Germany
9h30 Fast and memory-efficient partial order alignment with abPOA

Yan Gao Github

Children's Hospital of Philadelphia and Sun Yat-sen University
16h00 PanGenie - Pangenome-based inference

Jana Ebler website

Heinrich Heine University Düsseldorf
16h30 A pangenome for the expanded BXD family of mice

Flavia Villani website

University of Tennessee
17h00 MONI: Design Challenges for a Pangenomic Tool

Massimiliano Rossi home page

University of Florida

Schedule

All times are Central European Time (CET).

Time	Slot	Description
Dec 09 - 9h00	Check-in
9h30	Shilpa Garg Advanced high-resolution approaches: human health and biodiversity genomics	Chromosome-scale haplotypes are important to study genetic variation associated to diseases and biodiversity evolution. Advancements in third-generation sequencing opened enormous opportunities to reconstruct genomes at a single-base resolution. In this talk, I will present sequencing methods and graph-based approaches to combine the data types in an efficient manner. Further, I will provide examples relevant in clinical setting. I will demonstrate these comparative genomics.
10h00	Hackathon	Hack Goal
12h00	Lunch
13h00	Hackathon	Hack Goal
15h00	Rayan Chikhi Representing 661,405 bacterial genomes using minimizer-space de Bruijn graphs	DNA sequencing continues to progress toward longer and more accurate reads. Yet, primary analyses, such as genome assembly and pangenome graph construction, remain challenging and energy-inefficient. We recently introduced the concept of minimizer-space sequencing analysis, expanding the alphabet of DNA sequences to atomic tokens made of fixed-length words. This leads to orders-of-magnitude improvements in speed and memory usage for human genome assembly and metagenome assembly and enables for the first time a representation of a pangenome made of 661,405 bacterial genomes.
15h30	Danang Crysnanto Establishing Bovine Pangenome Graphs	The reference genome is generated from a single or few individuals and thus, it is a poor representation of the full species diversity. This pitfall has led to the development of the pangenome; the use of multiple genomes for genomic analysis. In this study, we integrated six cattle reference-quality genome assemblies into a pangenome graph, and we found 70 Mb sequences are not included in the existing cattle reference genome. We further demonstrated that these non-reference sequences contain functionally active bases and thousands of polymorphic sites that remain undetected with a single linear genome. Our findings call the need for a more representative reference genome that captures the entire species diversity.
16h00	Jouni Siren Giraffe: A Pangenomic Short Read Aligner	Pangenomic aligners try to avoid reference bias in read mapping by including common variants in the reference. They achieve a higher mapping accuracy than traditional linear aligners, at the expense of being an order of magnitude slower. We present Giraffe, a new short read aligner that avoids this trade-off with a number of algorithmic and data model improvements. By combining the speed of linear aligners with the accuracy of pangenomic aligners, Giraffe is a viable choice for large-scale sequencing projects.
16h30	Christian Kubica gSV - a reference free SV caller	While reference guided variant detection has helped us to explore a large fraction of, mostly non-complex, variation, its properties derived from pairwise comparisons have hindered true pan-genomic analysis. Until now, variant calls have always been reference biased and dependent on sequence being available in the singular reference genome. The coordinate system used has been reference based and thus the sequence context in the query sequence has been lost. Furthermore, nested variation has been hard to access from such variant calls. By using the information stored in whole genome alignment derived genome graphs we are now able to explore variation in a true pan-genomic environment. Here we present gSV, a first solution to call complex and nested variation as it is represented in a genome graph. gSV explores the graph inherent bubble structures from a fully pan-genomic point of view, thus removing the need for a reference coordinate system. This enables us to access nested variation and assign hierarchical parent-child relationships to bubbles while preserving the sequence context of each variant in every genome.
17h30	End of Day 1
Dec 10 - 9h00	Beginning of Day 2
9h30	Yan Gao Fast and memory-efficient partial order alignment with abPOA	We present abPOA (adaptive banded Partial Order Alignment), a Single Instruction Multiple Data (SIMD)-based C library for fast and memory-efficient partial order alignment. abPOA uses a minimizer-based seeding and partition approach to split sequence and graph into small windows and separately performs partial order alignment within each window
10h00	Hackathon	Hack Goal
12h00	Lunch
13h00	Hackathon	Hack Goal
16h00	Jana Ebler PanGenie - Pangenome-based inference	Typical analysis workflows map reads to a reference genome in order to genotype genetic variants. Generating such alignments introduces reference biases and comes with substantial computational burden. In contrast, recent k-mer based genotypers are fast, but struggle in repetitive or duplicated genomic regions. In this talk, I propose a novel algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference in conjunction with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation - a process we refer to as genome inference. Compared to mapping-based approaches, PanGenie is more than 4x faster at 30x coverage and reaches significantly better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (>=50bp) and variants in repetitive regions, enabling the inclusion of these classes of variants in genome-wide association studies. PanGenie efficiently leverages the increasing amount of haplotype-resolved assemblies to unravel the functional impact of previously inaccessible variants while being scalable to thousands of genotyped samples.
16h30	Flavia Villani A pangenome for the expanded BXD family of mice	The members of BXDs family have been inbred for 20-200 generations . They are of great value for mapping complex traits and phenome-wide association analysis. Current genomic studies on BXD assume a single linear reference genome, making it difficult to observe sequences diverging from the reference, therefore limiting the accuracy and completeness of analyses. We sequenced all extant members of all BXD families using linked-read libraries and built the pangenome graph to study genetic variation. We determined that linked reads are not ideal for pangenome building, nevertheless the pangenome enhanced the calling of complex variants not seen by traditional genomics methods and provided calls with good precision and sensitivity. As a case study we followed up a strain specific 2kb insertion that was inherited by half of the recombinant mice and correlated the genotype at this locus with clinically relevant phenotypes present in the GeneNetwork database.
17h00	Massimiliano Rossi MONI: Design Challenges for a Pangenomic Tool	MONI is a pangenomic index for finding maximal exact matches (MEMs), that is built on top of the r-index but also includes auxiliary data structures to allow finding approximate matches on a pangenomic scale. It includes many open-source tools and data structures such as BigBWT, BigRePair, and SDSL. I will talk about MONI, which is written in C++ and freely available at https://github.com/maxrossi91/moni
17h30	Discussion and Greetings

Organizers

Why pangenomics?

The Human Genome Project's assembly of the first genome took 13 years and cost $2.7 billion. Today, consumers can sequence their whole genomes for as little as $300. Yet most biomedical researchers continue to use a single linear reference genome to assemble larger genome sequences, detect gene variation, and determine gene function.

Who defines this reference genome?

How is it possible to capture the entire diversity of a species with an arbitrary single linear sequence?

In contrast, a pangenome is a collection of genomes organized into a graph data structure to be used for holistic genetic analysis. It is hard to overstate the impact of moving from a sequence-based to a graph-based view of genetics. A traditional single sequence-based reference genome might be missing up to 10% of the genetic sequence in under-represented populations these missing sequences are the unexplored `dark matter` of the genetic universe. Today, all human DNA variants are identified comparing with this standard reference genome. Even though the `standard' reference was a massive step forward at the time, it is now known to be biased and misrepresents whole populations: for example, 10% of African DNA sequence does not align to the current reference genome.

For other populations the gap may be smaller but the reference approach is biased and therefore impacts genetical studies of underrepresented populations and even supposedly well represented populations.

This unexplored genomics `dark matter' might include beneficial or harmful traits that traditional techniques are not equipped to study. For example, recent work suggests that a pangenomic approach could have more efficiently discovered a critical genetic variant in the Icelandic population reducing the risk of heart attacks and a different critical variant increasing the risk of dementia.

Pangenomics will transform the software stack in computational biology, requiring new graph data structures and new algorithms for graph construction, alignment, annotation, and visualization.

We expect pangenome tools to replace reference-based approaches used in most genomic workflows today. As an example of this work and the potential impact of computational pangenomics, consider the pangenome graph for chromosome 20 very recently constructed for the HPRC. Computational pangenomics revealed a hairball in the graph that indicates an unexpected amount of variation in the previously-inaccessible centromeric region.

Pangenomics Bio Hacking 2021

December 9-10, 2021, Online, Virtually in Milano

About

Location

Videos

Chats

Speakers

9h30 Advanced high-resolution approaches: human health and biodiversity genomics

Shilpa Garg home page

15h00 Representing 661,405 bacterial genomes using minimizer-space de Bruijn graphs

Rayan Chikhi homepage

15h30 Establishing Bovine Pangenome Graphs

Danang Crysnanto website

16h00 Giraffe: A Pangenomic Short Read Aligner

Jouni Siren home page

16h30 gSV - a reference free SV caller

Christian Kubica Github

9h30 Fast and memory-efficient partial order alignment with abPOA

Yan Gao Github

16h00 PanGenie - Pangenome-based inference

Jana Ebler website

16h30 A pangenome for the expanded BXD family of mice

Flavia Villani website

17h00 MONI: Design Challenges for a Pangenomic Tool

Massimiliano Rossi home page

Schedule

Sponsors

Organizers

Why pangenomics?

Who defines this reference genome?

Further reading