Recommendation

Diving, and even digging, into the wild jungle of annotation pathways for non-vertebrate animals

Francois Sabot based on reviews by Yann Bourgeois, Cécile Monat, Valentina Peona and Benjamin Istace

A recommendation of:

A deep dive into genome assemblies of non-vertebrate animals

Nadège Guiglielmoni, Ramón Rivera-Vicéns, Romain Koszul, Jean-François Flot (2022), Preprints, 2021110170, ver. 3 peer-reviewed and recommended by Peer Community in Genomics https://doi.org/10.20944/preprints202111.0170.v3

Read preprint in preprint server Now published in Peer Community Journal

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

A deep dive into genome assemblies of non-vertebrate animals

Non-vertebrate species represent about ∼95% of known metazoan (animal) diversity. They remain to this day relatively unexplored genetically, but understanding their genome structure and function is pivotal for expanding our current knowledge of evolution, ecology and biodiversity. Following the continuous improvements and decreasing costs of sequencing technologies, many genome assembly tools have been released, leading to a significant amount of genome projects being completed in recent years. In this review, we examine the current state of genome projects of non-vertebrate animal species. We present an overview of available sequencing technologies, assembly approaches, as well as pre and post-processing steps, genome assembly evaluation methods, and their application to non-vertebrate animal genomes.

genome assembly, sequencing, non-vertebrate animals

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

الغوص العميق في تجمعات الجينوم للحيوانات غير الفقارية

تمثل الأنواع غير الفقارية حوالي ∼ 95% من تنوع الميتازوان (الحيواني) المعروف. ولا تزال حتى يومنا هذا غير مستكشفة وراثيًا نسبيًا، لكن فهم بنية الجينوم ووظيفته أمر محوري لتوسيع معرفتنا الحالية بالتطور والبيئة والتنوع البيولوجي. في أعقاب التحسينات المستمرة وخفض تكاليف تقنيات التسلسل، تم إطلاق العديد من أدوات تجميع الجينوم، مما أدى إلى إكمال عدد كبير من مشاريع الجينوم في السنوات الأخيرة. في هذه المراجعة، نقوم بدراسة الوضع الحالي لمشاريع الجينوم للأنواع الحيوانية غير الفقارية. نقدم نظرة عامة على تقنيات التسلسل المتاحة، وأساليب التجميع، بالإضافة إلى خطوات ما قبل وبعد المعالجة، وطرق تقييم تجميع الجينوم، وتطبيقها على الجينومات الحيوانية غير الفقارية.

تجميع الجينوم، التسلسل، الحيوانات غير الفقارية

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Una inmersión profunda en ensamblajes genómicos de animales no vertebrados

Las especies no vertebradas representan alrededor del 95% de la diversidad de metazoos (animales) conocida. Hasta el día de hoy permanecen relativamente inexplorados genéticamente, pero comprender la estructura y función de su genoma es fundamental para ampliar nuestro conocimiento actual sobre la evolución, la ecología y la biodiversidad. Tras las mejoras continuas y la disminución de los costos de las tecnologías de secuenciación, se han lanzado muchas herramientas de ensamblaje del genoma, lo que ha llevado a que se completen una cantidad significativa de proyectos genómicos en los últimos años. En esta revisión, examinamos el estado actual de los proyectos de genoma de especies animales no vertebrados. Presentamos una descripción general de las tecnologías de secuenciación disponibles, los enfoques de ensamblaje, así como los pasos previos y posteriores al procesamiento, los métodos de evaluación del ensamblaje del genoma y su aplicación a genomas de animales no vertebrados.

ensamblaje del genoma, secuenciación, animales no vertebrados

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Une plongée approfondie dans les assemblages génomiques d’animaux non vertébrés

Les espèces non vertébrées représentent environ ∼95 % de la diversité métazoaire (animale) connue. Ils restent à ce jour relativement inexplorés génétiquement, mais la compréhension de la structure et de la fonction de leur génome est essentielle pour élargir nos connaissances actuelles sur l’évolution, l’écologie et la biodiversité. Suite aux améliorations continues et à la diminution des coûts des technologies de séquençage, de nombreux outils d’assemblage du génome ont été publiés, conduisant à la réalisation d’un nombre important de projets sur le génome ces dernières années. Dans cette revue, nous examinons l’état actuel des projets génomiques des espèces animales non vertébrées. Nous présentons un aperçu des technologies de séquençage disponibles, des approches d'assemblage, ainsi que des étapes de pré et post-traitement, des méthodes d'évaluation de l'assemblage du génome et de leur application aux génomes d'animaux non vertébrés.

assemblage du génome, séquençage, animaux non vertébrés

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

गैर-कशेरुकी जानवरों की जीनोम असेंबली में एक गहरा गोता

गैर-कशेरुकी प्रजातियां ज्ञात मेटाज़ोअन (पशु) विविधता का लगभग ∼95% प्रतिनिधित्व करती हैं। वे आज तक आनुवंशिक रूप से अपेक्षाकृत अज्ञात हैं, लेकिन उनकी जीनोम संरचना और कार्य को समझना विकास, पारिस्थितिकी और जैव विविधता के हमारे वर्तमान ज्ञान का विस्तार करने के लिए महत्वपूर्ण है। अनुक्रमण प्रौद्योगिकियों में निरंतर सुधार और घटती लागत के बाद, कई जीनोम असेंबली उपकरण जारी किए गए हैं, जिससे हाल के वर्षों में महत्वपूर्ण मात्रा में जीनोम परियोजनाएं पूरी हो रही हैं। इस समीक्षा में, हम गैर-कशेरुकी पशु प्रजातियों की जीनोम परियोजनाओं की वर्तमान स्थिति की जांच करते हैं। हम उपलब्ध अनुक्रमण प्रौद्योगिकियों, असेंबली दृष्टिकोण, साथ ही पूर्व और बाद के प्रसंस्करण चरणों, जीनोम असेंबली मूल्यांकन विधियों और गैर-कशेरुकी जानवरों के जीनोम पर उनके अनुप्रयोग का अवलोकन प्रस्तुत करते हैं।

जीनोम असेंबली, अनुक्रमण, गैर-कशेरुकी जानवर

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

非脊椎動物のゲノムアセンブリを深く掘り下げる

非脊椎動物の種は、既知の後生動物（動物）の多様性の約 95% を占めています。それらは今日まで遺伝的には比較的未解明のままですが、そのゲノムの構造と機能を理解することは、進化、生態学、生物多様性に関する現在の知識を広げる上で極めて重要です。シーケンス技術の継続的な改善とコストの削減に続いて、多くのゲノムアセンブリツールがリリースされ、近年、かなりの量のゲノムプロジェクトが完了しています。このレビューでは、非脊椎動物種のゲノムプロジェクトの現状を調査します。利用可能なシーケンシング技術、アセンブリアプローチ、前処理および後処理ステップ、ゲノムアセンブリ評価方法、および非脊椎動物ゲノムへのそれらの応用の概要を示します。

ゲノムアセンブリ、配列決定、非脊椎動物

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Um mergulho profundo nas montagens genômicas de animais não vertebrados

As espécies não vertebradas representam cerca de ∼95% da diversidade conhecida de metazoários (animais). Eles permanecem até hoje relativamente inexplorados geneticamente, mas compreender a estrutura e função do seu genoma é fundamental para expandir o nosso conhecimento atual sobre evolução, ecologia e biodiversidade. Seguindo as melhorias contínuas e a diminuição dos custos das tecnologias de sequenciamento, muitas ferramentas de montagem de genoma foram lançadas, levando à conclusão de uma quantidade significativa de projetos de genoma nos últimos anos. Nesta revisão, examinamos o estado atual dos projetos de genoma de espécies animais não vertebradas. Apresentamos uma visão geral das tecnologias de sequenciamento disponíveis, abordagens de montagem, bem como etapas de pré e pós-processamento, métodos de avaliação de montagem de genoma e sua aplicação a genomas de animais não vertebrados.

montagem do genoma, sequenciamento, animais não vertebrados

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Глубокое погружение в сборки генома беспозвоночных животных

Виды, не являющиеся позвоночными, составляют около 95 % известного разнообразия многоклеточных животных (животных). Они остаются по сей день относительно неизученными генетически, но понимание структуры и функций их генома имеет решающее значение для расширения наших текущих знаний об эволюции, экологии и биоразнообразии. В результате постоянного совершенствования и снижения затрат на технологии секвенирования было выпущено множество инструментов для сборки генома, что привело к завершению значительного количества геномных проектов в последние годы. В этом обзоре мы рассматриваем современное состояние проектов генома беспозвоночных видов животных. Мы представляем обзор доступных технологий секвенирования, подходов к сборке, а также этапов предварительной и постобработки, методов оценки сборки генома и их применения к геномам беспозвоночных животных.

сборка генома, секвенирование, беспозвоночные животные

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

深入研究非脊椎动物的基因组组装

非脊椎动物物种代表了大约 95% 的已知后生动物（动物）多样性。迄今为止，它们在遗传学上仍然相对未经探索，但了解它们的基因组结构和功能对于扩展我们目前对进化、生态和生物多样性的了解至关重要。随着测序技术的不断改进和成本的降低，许多基因组组装工具已经发布，导致近年来完成了大量的基因组项目。在这篇综述中，我们研究了非脊椎动物物种基因组计划的现状。我们概述了可用的测序技术、组装方法以及预处理和后处理步骤、基因组组装评估方法及其在非脊椎动物基因组中的应用。

基因组组装、测序、非脊椎动物

Submission: posted 10 November 2021
Recommendation: posted 21 April 2022, validated 06 May 2022

Cite this recommendation as:
Sabot, F. (2022) Diving, and even digging, into the wild jungle of annotation pathways for non-vertebrate animals. Peer Community in Genomics, 100016. https://doi.org/10.24072/pci.genomics.100016

Recommendation

In their paper, Guiglielmoni et al. propose we pick up our snorkels and palms and take "A deep dive into genome assemblies of non-vertebrate animals" (1). Indeed, while numerous assembly-related tools were developed and tested for human genomes (or at least vertebrates such as mice), very few were tested on non-vertebrate animals so far. Moreover, most of the benchmarks are aimed at raw assembly tools, and very few offer a guide from raw reads to an almost finished assembly, including quality control and phasing.

This huge and exhaustive review starts with an overview of the current sequencing technologies, followed by the theory of the different approaches for assembly and their implementation. For each approach, the authors present some of the most representative tools, as well as the limits of the approach.

The authors additionally present all the steps required to obtain an almost complete assembly at a chromosome-scale, with all the different technologies currently available for scaffolding, QC, and phasing, and the way these tools can be applied to non-vertebrates animals. Finally, they propose some useful advice on the choice of the different approaches (but not always tools, see below), and advocate for a robust genome database with all information on the way the assembly was obtained.

This review is a very complete one for now and is a very good starting point for any student or scientist interested to start working on genome assembly, from either model or non-model organisms. However, the authors do not provide a list of tools or a benchmark of them as a recommendation. Why? Because such a proposal may be obsolete in less than a year.... Indeed, with the explosion of the 3rd generation of sequencing technology, assembly tools (from different steps) are constantly evolving, and their relative performance increases on a monthly basis. In addition, some tools are really efficient at the time of a review or of an article, but are not further developed later on, and thus will not evolve with the technology. We have all seen it with wonderful tools such as Chiron (2) or TopHat (3), which were very promising ones, but cannot be developed further due to the stop of the project, the end of the contract of the post-doc in charge of the development, or the decision of the developer to switch to another paradigm. Such advice would, therefore, need to be constantly updated.

Thus, the manuscript from Guiglielmoni et al will be an almost intemporal one (up to the next sequencing revolution at last), and as they advocated for a more informed genome database, I think we should consider a rolling benchmarking system (tools, genome and sequence dataset) allowing to keep the performance of the tools up-to-date, and to propose the best set of assembly tools for a given type of genome.

References

1. Guiglielmoni N, Rivera-Vicéns R, Koszul R, Flot J-F (2022) A Deep Dive into Genome Assemblies of Non-vertebrate Animals. Preprints, 2021110170, ver. 3 peer-reviewed and recommended by Peer Community in Genomics. https://doi.org/10.20944/preprints202111.0170

2. Teng H, Cao MD, Hall MB, Duarte T, Wang S, Coin LJM (2018) Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning. GigaScience, 7, giy037. https://doi.org/10.1093/gigascience/giy037

3. Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25, 1105–1111. https://doi.org/10.1093/bioinformatics/btp120

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Reviewed by Cécile Monat, 23 Mar 2022

This new version of the manuscript is enriched with corrections and precisions to anwered me and other reviewers questions and suggestions, making it a better article.

https://doi.org/10.24072/pci.genomics.100016.rev21

Reviewed by Yann Bourgeois, 11 Apr 2022

The authors have provided an answer to all my main comments, and I mostly agree with them. I still believe that a rough estimate of the current (and past) costs for different techniques could be provided, acknowledging that this may become quickly obsolete. This would provide an upper range estimate for teams intending to start a genome sequencing project. The review is already thorough and I am happy to support acceptance.
Best wishes,

Yann Bourgeois

https://doi.org/10.24072/pci.genomics.100016.rev22

Reviewed by Benjamin Istace, 15 Mar 2022

The authors have successfully addressed all my concerns.

https://doi.org/10.24072/pci.genomics.100016.rev23

Evaluation round #1

DOI or URL of the preprint: 10.20944/preprints202111.0170.v1

Version of the preprint: 1

Author's Reply, 07 Mar 2022

Download author's reply https://doi.org/10.24072/pci.genomics.100155.ar1

Decision by Francois Sabot, posted 06 Jan 2022

Dear Dr Guiglielmoni,

I have been through your manuscript, as well as 4 independent reviewers, and we all agree that the manuscript is of high interest.

They all, however, highlighted minor comments before acceptance of the manuscript, that I encourage you to perform quite fastly before I can accept if.

In addition, Dr Bourgeois discussed a lot on different aspects of the manuscript that in my opinion are of great interest. Indeed, proposing specific tools for each step would be of great help for non-specialists and beginners...

However, based on my own experience, such recommendations, while of high quality at the given time of the publication and on some specific genomes, would be quite fastly outdated and may be misleading to readers.

Thus, these comments, while very interesting, are for me to be the subject of an online list that can be quickly updated. I would then propose that you discuss them in the manuscript in this way.

Sincerely yours,

Dr Francois Sabot

https://doi.org/10.24072/pci.genomics.100155.d1

Reviewed by Cécile Monat, 29 Nov 2021

First I would like to thanks the authors for the work they have done. Here they present a review paper about sequencing non-vertebrates genomes. As a whole, this paper is very pleasant to read.

Each part is rich of details on history of technologies and methods. Presentation of tools is quite exhaustive. Those two arguments made this paper an excellent starting point for non familiar people with sequencing technologies and more particularly for sequencing non-vertebrates genomes.

In figure 2, I would recommand to use some color to make the message easier to understand, and to use a monospace police for the consensus part.
The central part of the figure 5 might be improved, maybe with clear arrows direction and starting point.

https://doi.org/10.24072/pci.genomics.100155.rev11

Reviewed by Valentina Peona, 17 Dec 2021

Download the review https://doi.org/10.24072/pci.genomics.100155.rev12

Reviewed by Benjamin Istace, 26 Nov 2021

I read the manuscript titled «A deep dive into genome assemblies of non-vertebrate animals» by Guiglielmoni et al. with great interest. The authors talk about existing methods and algorithms for constructing contiguous and accurate genome assemblies in the context of metazoan genomes. In my opinion, the article is well written and easily understandable by non-specialists. I only have minor concerns that I would like the authors to address if they agree with me.

## Introduction
7;894 => 7,894

## Sequencing
Figure 1: I understand the intent of this figure, but I find it pretty challenging to read, and points hide other points. One way of fixing this would be to aggregate the data of each category per year and turn it into a boxplot.
«The resulting reads have a length around twenty kilobases (kb)»: In my experience, PacBio reads usually have a mean size around 15kb that can go up to 25kb (see https://www.nature.com/articles/s41597-020-00743-4 as an example).
«The error rate has also been decreasing with the release of new flow cells and the development of more accurate basecallers such as Bonito.» There is also a new protocol called Q20+, which makes it possible to generate reads with a 1% error rate.

## Genome assembly
«DBG-based assemblers require highly accurate reads in which errors are only substitutions, with no indels»: why should there be no indels?
«To this end, heterozygous regions are collapsed in order to keep a single sequence for every region in the genome»: this is true if the genome is not very heterozygous. In the other scenario, both haplotypes can often be retrieved, as heterozygous regions are pretty different.

## Assembly pre and post-processing
Table 2 - Long reads error correction: NaS is missing. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1519-z
Table 2 - Short and long reads polishing: a new tool called HAPO-G has been published recently and is absent from the list. It has been developed explicitly to polish heterozygous genomes but also handles homozygous genomes. https://academic.oup.com/nargab/article/3/2/lqab034/6262629
Figure 6: Same as Figure 1
Drawbacks of using Hi-C are not presented. As an example, the fact that gap sizes cannot be estimated is not indicated.
«Assembly and pre/post-processing steps are often combined in one tool» makes it look like there is no need to post-process assemblies further, but if the polishing step is only done with long reads, the final quality will not be great.

## Phasing assemblies
Hifiasm is another assembler that can phase haplotypes.

Download the review https://doi.org/10.24072/pci.genomics.100155.rev13

Reviewed by Yann Bourgeois, 17 Dec 2021

This work reviews the current state of methods for genome sequencing and de novo assembly, with a particular focus on invertebrates, for which resources are still missing. This sort of work should be encouraged, as it aims at expanding genomic resources to non-model species, which is crucial to obtain a more comprehensive picture of the evolutionary and mechanistic processes underlying biological diversity. The “technical” content is comprehensive and mostly up-to-date. My main concerns are mostly revolving around the structure and the scope of the review. In its current state, it reads like a rather “generic” review about assembly tools, with illustrations drawn from genomic studies of invertebrate species. I think that the review would benefit from a more explicit description of the specific challenges encountered in invertebrates. Low DNA amounts is mentioned, but there are other aspects that could be described. For example, many species are difficult to raise in controlled conditions, or rare in the wild, or poorly described from a taxonomic perspective. On the other hand, many species of arthropods reproduce asexually (e.g. Daphnia), which may help increasing the yield of DNA from the same genotype. At the moment, it reads more like a collection of anecdotes (which I agree all reveal an interesting problem): there may be a better way to structure it.

It would also be good to explain from the beginning the readership that this review targets. For example, I understand the interest of adopting a historical perspective in the first section (Sequencing) if the review is a resource for new practitioners. However, a review that aims at explaining the current methods for genome assembly to "naive" readers should take more time explaining basic concepts (e.g. N50). A glossary could be useful. On the other hand, if the review is addressed to scientists who already have some experience with the techniques and the terms, the somewhat long description of Sanger sequencing may not be particularly useful.

In my opinion the review does not provide (yet) a guide to decide of a sequencing strategy. The information is already there, but could be highlighted in a more organized way. Figure 6 is a good example of what could be done more extensively throughout the review in my opinion (with more details).

The authors could compare the quality of currently available assemblies, using several metrics, and highlight the methods used to obtain them. For example, what sequencing depth of coverage is needed when using only Illumina reads + mate pairs? Hi-C? PacBio + Illumina short reads? What is the average cost? It would be useful to have figures such as decision-making flowcharts. Figure 5 could be expanded to highlight the different possible options at each step (short-reads? Long-reads? What is the best option given a budget of 10,000$? 50,000$?). What are the bioinformatic resources needed? What is the runtime of different programs, and how this runtime scales with genome size and complexity?

I also think that mentioning reference-guided assemblies could be useful, especially for readers who consider working on a species related to one that has already been sequenced. If there are reasons to assume that synteny is high and divergence low, reference-guided assemblies may be a good way for researchers with limited financial resources to obtain a valuable resource. A particularly interesting paper from this perspective (in my opinion) is the following one (Lischer & Shimizu, BMC Bioinformatics, 2017):

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1911-6

Note that this paper also proposes an interesting way to test for the quality of assemblies obtained by different methods through the combination of 36 summary statistics (using z-scores for each of the statistics and comparing their distributions across methods).

At last, it may be worth explaining what can be done with a genome assembly depending on its quality. If the goal consists in running preliminary population genetics analyses, a fragmented assembly can already be very useful. For comparative genomic analyses, assessment of repetitive content (transposable elements), or functional studies, high quality assemblies are the target to reach.

Nevertheless, I want to emphasize the fact that the review is rather comprehensive, and mostly needs polishing to increase its impact on a broad range of readers.

Minor comments through the text:

Introduction, Paragraph 6: The bit about BUSCO feels slightly too long, although the issue highlighted is very interesting. There are many other possible biases that could be discussed. Maybe shorten it, and provide other examples of how bias towards model systems can impair research on non-vertebrates. In general, the Introduction would benefit from explicitly stating the scope of the review, and what it means to achieve (decision-making tool? Comparison of methods? Introduction to the field for new practitioners?).

Sequencing, second paragraph. N50 is usually low for second generation sequencing, as you mention, but using Hi-C, Hi-Fi or mate pairs (which I would still classify as second-generation sequencing) can improve assemblies a lot.

Sequencing, third paragraph. The current increase in accuracy for base calling and assembly from nanopore reads is encouraging, but should be discussed more in terms of minimum depth of coverage required, the quality of training datasets (for algorithms using machine/deep learning), etc. Note the existence of another base-caller, Poreover, to be used in combination with Bonito https://github.com/jordisr/poreover

Table 1: This table is a good resource, but it may be worth considering merging it with table 2. A classification highlighting speed and memory requirements would be useful. As mentioned in the main comments, I am not sure that the row on first-generation sequencing is particularly useful.

P8: You talk here about k-mers, but what about decisions on which k-mer length to use? Why is it important to use several k-mer lengths when assembling? This is something that you could already explain here.

P17: Assembly evaluation. There are so many ways to estimate the quality of an assembly that some authors have proposed a set of tens of summary statistics, that they summarize as a Z-score. (check papers on reference-guided assemblies).

Figure 4: It would be interesting from a decision-making perspective to add a panel with the different techniques used to assemble these genomes.

https://doi.org/10.24072/pci.genomics.100155.rev14

User comments

No user comments yet

or Register
Submit a preprint