Recommendation

TransPI: A balancing act between transcriptome assemblers

Oleg Simakov based on reviews by Gustavo Sanchez and Juan Daniel Montenegro Cabrera

A recommendation of:

TransPi - a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly

Ramon E Rivera-Vicens, Catalina Garcia-Escudero, Nicola Conci, Michael Eitel, Gert Wörheide (2021), bioRxiv, 2021.02.18.431773, ver. 3 peer-reviewed and recommended by Peer Community in Genomics https://doi.org/10.1101/2021.02.18.431773

Read preprint in preprint server Now published in a journal

Data used for results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

TransPi - a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly

The use of RNA-Seq data and the generation of de novo transcriptome assemblies have been pivotal for studies in ecology and evolution. This is distinctly true for non-model organisms, where no genome information is available. Nevertheless, studies of differential gene expression, DNA enrichment baits design, and phylogenetics can all be accomplished with the data gathered at the transcriptomic level. Multiple tools are available for transcriptome assembly, however, no single tool can provide the best assembly for all datasets. Therefore, a multi assembler approach, followed by a reduction step, is often sought to generate an improved representation of the assembly. To reduce errors in these complex analyses while at the same time attaining reproducibility and scalability, automated workflows have been essential in the analysis of RNA-Seq data. However, most of these tools are designed for species where genome data is used as reference for the assembly process, limiting their use in non-model organisms. We present TransPi, a comprehensive pipeline for de novo transcriptome assembly, with minimum user input but without losing the ability of a thorough analysis. A combination of different model organisms, k-mer sets, read lengths, and read quantities were used for assessing the tool. Furthermore, a total of 49 non-model organisms, spanning different phyla, were also analyzed. Compared to approaches using single assemblers only, TransPi produces higher BUSCO completeness percentages, and a concurrent significant reduction in duplication rates. TransPi is easy to configure and can be deployed seamlessly using Conda, Docker and Singularity

transcriptome, assembly, pipeline, nextflow

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

TransPi - خط أنابيب تحليل TRAnscriptome شامل لتجميع النسخ الجديد

كان استخدام بيانات RNA-Seq وتوليد مجموعات نسخ دي نوفو محوريًا للدراسات في علم البيئة والتطور. وينطبق هذا بشكل واضح على الكائنات غير النموذجية، حيث لا تتوفر معلومات الجينوم. ومع ذلك، فإن دراسات التعبير الجيني التفاضلي، وتصميم طعوم تخصيب الحمض النووي، وعلم الوراثة يمكن إنجازها جميعًا باستخدام البيانات المجمعة على المستوى النسخي. تتوفر أدوات متعددة لتجميع النسخ، ومع ذلك، لا توجد أداة واحدة يمكنها توفير أفضل تجميع لجميع مجموعات البيانات. ولذلك، غالبًا ما يتم السعي إلى اتباع نهج المجمعات المتعددة، متبوعًا بخطوة التخفيض، لإنشاء تمثيل محسن للتجميع. لتقليل الأخطاء في هذه التحليلات المعقدة مع تحقيق إمكانية التكرار وقابلية التوسع في نفس الوقت، كانت سير العمل الآلي ضرورية في تحليل بيانات RNA-Seq. ومع ذلك، فإن معظم هذه الأدوات مصممة للأنواع حيث يتم استخدام بيانات الجينوم كمرجع لعملية التجميع، مما يحد من استخدامها في الكائنات غير النموذجية. نقدم TransPi، وهو خط أنابيب شامل لتجميع النسخ من جديد، مع الحد الأدنى من إدخال المستخدم ولكن دون فقدان القدرة على إجراء تحليل شامل. تم استخدام مجموعة من الكائنات الحية النموذجية المختلفة ومجموعات k-mer وأطوال القراءة وكميات القراءة لتقييم الأداة. علاوة على ذلك، تم أيضًا تحليل ما مجموعه 49 كائنًا غير نموذجي، يمتد إلى أنواع مختلفة. بالمقارنة مع الأساليب التي تستخدم المجمعات الفردية فقط، تنتج TransPi نسبًا أعلى لاكتمال BUSCO، وانخفاضًا كبيرًا متزامنًا في معدلات الازدواجية. من السهل تكوين TransPi ويمكن نشره بسلاسة باستخدام Conda وDocker وSingularity

النسخ، التجميع، خط الأنابيب، التدفق التالي

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

TransPi - un pipeline complet d'analyse du transcriptome pour l'assemblage du transcriptome de novo

L'utilisation de données RNA-Seq et la génération d'assemblages de transcriptome de novo ont joué un rôle crucial pour les études en écologie et en évolution. Cela est particulièrement vrai pour les organismes non modèles, pour lesquels aucune information génomique n’est disponible. Néanmoins, des études sur l'expression différentielle des gènes, la conception d'appâts d'enrichissement en ADN et la phylogénétique peuvent toutes être réalisées avec les données recueillies au niveau transcriptomique. Plusieurs outils sont disponibles pour l’assemblage du transcriptome, cependant, aucun outil ne peut fournir le meilleur assemblage pour tous les ensembles de données. Par conséquent, une approche multi-assembleur, suivie d’une étape de réduction, est souvent recherchée pour générer une représentation améliorée de l’assemblage. Pour réduire les erreurs dans ces analyses complexes tout en atteignant la reproductibilité et l’évolutivité, les flux de travail automatisés ont été essentiels dans l’analyse des données RNA-Seq. Cependant, la plupart de ces outils sont conçus pour des espèces pour lesquelles les données génomiques sont utilisées comme référence pour le processus d'assemblage, limitant leur utilisation dans des organismes non modèles. Nous présentons TransPi, un pipeline complet pour l'assemblage de transcriptome de novo, avec une contribution minimale de l'utilisateur mais sans perdre la capacité d'une analyse approfondie. Une combinaison de différents organismes modèles, ensembles k-mer, longueurs de lecture et quantités de lecture ont été utilisées pour évaluer l'outil. En outre, un total de 49 organismes non modèles, couvrant différents phylums, ont également été analysés. Par rapport aux approches utilisant uniquement des assembleurs uniques, TransPi produit des pourcentages d'exhaustivité BUSCO plus élevés et une réduction significative simultanée des taux de duplication. TransPi est facile à configurer et peut être déployé de manière transparente à l'aide de Conda, Docker et Singularity

transcriptome, assemblage, pipeline, nextflow

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

TransPi - डे नोवो ट्रांसक्रिप्टोम असेंबली के लिए एक व्यापक ट्रांसक्रिप्टोम एनालिसिस पाइपलाइन

आरएनए-सेक डेटा का उपयोग और डे नोवो ट्रांसक्रिप्टोम असेंबली की पीढ़ी पारिस्थितिकी और विकास में अध्ययन के लिए महत्वपूर्ण रही है। यह गैर-मॉडल जीवों के लिए स्पष्ट रूप से सच है, जहां कोई जीनोम जानकारी उपलब्ध नहीं है। फिर भी, अंतर जीन अभिव्यक्ति, डीएनए संवर्धन चारा डिजाइन और फ़ाइलोजेनेटिक्स का अध्ययन ट्रांसक्रिपटामिक स्तर पर एकत्र किए गए डेटा के साथ पूरा किया जा सकता है। ट्रांसक्रिप्टोम असेंबली के लिए कई टूल उपलब्ध हैं, हालांकि, कोई भी एक टूल सभी डेटासेट के लिए सर्वोत्तम असेंबली प्रदान नहीं कर सकता है। इसलिए, असेंबली का बेहतर प्रतिनिधित्व उत्पन्न करने के लिए अक्सर एक मल्टी असेंबलर दृष्टिकोण की मांग की जाती है, जिसके बाद कटौती का कदम उठाया जाता है। इन जटिल विश्लेषणों में त्रुटियों को कम करने के साथ-साथ प्रतिलिपि प्रस्तुत करने योग्यता और स्केलेबिलिटी प्राप्त करने के लिए, आरएनए-सेक डेटा के विश्लेषण में स्वचालित वर्कफ़्लो आवश्यक हो गए हैं। हालाँकि, इनमें से अधिकांश उपकरण उन प्रजातियों के लिए डिज़ाइन किए गए हैं जहां जीनोम डेटा को असेंबली प्रक्रिया के संदर्भ के रूप में उपयोग किया जाता है, जो गैर-मॉडल जीवों में उनके उपयोग को सीमित करता है। हम न्यूनतम उपयोगकर्ता इनपुट के साथ लेकिन गहन विश्लेषण की क्षमता खोए बिना, डे नोवो ट्रांस्क्रिप्टोम असेंबली के लिए एक व्यापक पाइपलाइन, ट्रांसपी प्रस्तुत करते हैं। उपकरण का आकलन करने के लिए विभिन्न मॉडल जीवों, के-मेर सेट, पढ़ने की लंबाई और पढ़ने की मात्रा के संयोजन का उपयोग किया गया था। इसके अलावा, विभिन्न फ़ाइला में फैले कुल 49 गैर-मॉडल जीवों का भी विश्लेषण किया गया। केवल एकल असेंबलरों का उपयोग करने वाले दृष्टिकोणों की तुलना में, ट्रांसपी उच्च BUSCO पूर्णता प्रतिशत और दोहराव दरों में समवर्ती महत्वपूर्ण कमी का उत्पादन करता है। ट्रांसपी को कॉन्फ़िगर करना आसान है और इसे कॉनडा, डॉकर और सिंगुलैरिटी का उपयोग करके निर्बाध रूप से तैनात किया जा सकता है

ट्रांस्क्रिप्टोम, असेंबली, पाइपलाइन, नेक्स्टफ्लो

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

TransPi - de novo トランスクリプトームアセンブリのための包括的な Transcriptome ANalysiS パイプライン

RNA-Seq データの使用と de novo トランスクリプトームアセンブリの生成は、生態学と進化の研究にとって極めて重要です。これは、ゲノム情報が入手できない非モデル生物に明らかに当てはまります。それにもかかわらず、差次的遺伝子発現、DNA 濃縮ベイト設計、および系統発生学の研究はすべて、トランスクリプトームレベルで収集されたデータを使用して達成できます。トランスクリプトームのアセンブリには複数のツールが利用できますが、単一のツールですべてのデータセットに最適なアセンブリを提供できるわけではありません。したがって、アセンブリの改善された表現を生成するために、マルチアセンブラアプローチとその後の削減ステップが求められることがよくあります。これらの複雑な解析でのエラーを軽減し、同時に再現性と拡張性を達成するには、RNA-Seq データの解析において自動化されたワークフローが不可欠です。ただし、これらのツールのほとんどは、ゲノムデータがアセンブリプロセスの参照として使用される種向けに設計されているため、モデル以外の生物での使用は制限されています。我々は、最小限のユーザー入力でありながら徹底的な分析能力を失うことなく、de novo トランスクリプトームアセンブリのための包括的なパイプラインである TransPi を紹介します。ツールの評価には、さまざまなモデル生物、k-mer セット、リード長、リード量の組み合わせが使用されました。さらに、異なる門にまたがる合計 49 の非モデル生物も分析されました。単一のアセンブラのみを使用するアプローチと比較して、TransPi は BUSCO の完全性パーセンテージを高め、同時に重複率を大幅に削減します。 TransPi は構成が簡単で、Conda、Docker、Singularity を使用してシームレスにデプロイできます

トランスクリプトーム、アセンブリ、パイプライン、ネクストフロー

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

TransPi - um TRanscriptome ANalysiS PIpeline abrangente para montagem de novo transcriptoma

O uso de dados de RNA-Seq e a geração de conjuntos de transcriptomas de novo têm sido fundamentais para estudos em ecologia e evolução. Isto é claramente verdade para organismos não-modelo, onde nenhuma informação genômica está disponível. No entanto, estudos de expressão gênica diferencial, desenho de iscas de enriquecimento de DNA e filogenética podem ser realizados com os dados coletados no nível transcriptômico. Várias ferramentas estão disponíveis para montagem do transcriptoma, no entanto, nenhuma ferramenta pode fornecer a melhor montagem para todos os conjuntos de dados. Portanto, uma abordagem multiassembler, seguida por uma etapa de redução, é frequentemente procurada para gerar uma representação melhorada da montagem. Para reduzir erros nessas análises complexas e, ao mesmo tempo, obter reprodutibilidade e escalabilidade, fluxos de trabalho automatizados têm sido essenciais na análise de dados de RNA-Seq. No entanto, a maioria dessas ferramentas é projetada para espécies onde os dados do genoma são usados como referência para o processo de montagem, limitando seu uso em organismos não-modelo. Apresentamos TransPi, um pipeline abrangente para montagem de transcriptoma de novo, com entrada mínima do usuário, mas sem perder a capacidade de uma análise completa. Uma combinação de diferentes organismos modelo, conjuntos k-mer, comprimentos de leitura e quantidades de leitura foram usadas para avaliar a ferramenta. Além disso, um total de 49 organismos não-modelo, abrangendo diferentes filos, também foram analisados. Em comparação com abordagens que utilizam apenas montadores únicos, o TransPi produz porcentagens de completude BUSCO mais altas e uma redução simultânea significativa nas taxas de duplicação. TransPi é fácil de configurar e pode ser implantado perfeitamente usando Conda, Docker e Singularity

transcriptoma, montagem, pipeline, nextflow

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

TransPi — комплексный конвейер транскриптомного анализа для сборки транскриптома de novo

Использование данных RNA-Seq и создание сборок транскриптома de novo сыграло решающую роль в исследованиях в области экологии и эволюции. Это особенно верно для немодельных организмов, геномная информация о которых отсутствует. Тем не менее, исследования дифференциальной экспрессии генов, дизайна приманок для обогащения ДНК и филогенетики могут быть выполнены с использованием данных, собранных на транскриптомном уровне. Для сборки транскриптома доступно несколько инструментов, однако ни один инструмент не может обеспечить наилучшую сборку для всех наборов данных. Поэтому для создания улучшенного представления сборки часто используют подход с несколькими ассемблерами, за которым следует этап сокращения. Чтобы уменьшить количество ошибок в этих сложных анализах и в то же время добиться воспроизводимости и масштабируемости, при анализе данных RNA-Seq необходимы автоматизированные рабочие процессы. Однако большинство этих инструментов предназначены для видов, у которых данные генома используются в качестве эталона для процесса сборки, что ограничивает их использование в немодельных организмах. Мы представляем TransPi, комплексный конвейер для сборки транскриптома de novo с минимальным вмешательством пользователя, но без потери возможности тщательного анализа. Для оценки инструмента использовалась комбинация различных модельных организмов, наборов k-меров, длин чтений и количества чтений. Кроме того, в общей сложности было проанализировано 49 немодельных организмов, принадлежащих к разным типам. По сравнению с подходами, использующими только отдельные ассемблеры, TransPi обеспечивает более высокий процент полноты BUSCO и одновременно значительное снижение уровня дублирования. TransPi легко настроить и легко развернуть с помощью Conda, Docker и Singularity

транскриптом, сборка, конвейер, следующий поток

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

TransPi - 用于从头转录组组装的综合转录组分析流程

RNA-Seq 数据的使用和从头转录组组装的生成对于生态学和进化研究至关重要。对于没有可用基因组信息的非模型生物体来说尤其如此。然而，差异基因表达、DNA 富集诱饵设计和系统发育学的研究都可以通过转录组水平收集的数据来完成。有多种工具可用于转录组组装，但是，没有一个工具可以为所有数据集提供最佳组装。因此，通常寻求采用多汇编器方法，然后进行简化步骤来生成改进的装配表示。为了减少这些复杂分析中的错误，同时获得可重复性和可扩展性，自动化工作流程在 RNA-Seq 数据分析中至关重要。然而，这些工具大多数是为基因组数据用作组装过程参考的物种而设计的，限制了它们在非模式生物中的使用。我们推出了 TransPi，一种用于从头转录组组装的综合管道，只需最少的用户输入，但不会失去彻底分析的能力。使用不同模型生物体、k聚体集、读段长度和读段数量的组合来评估该工具。此外，还分析了跨越不同门的总共 49 种非模式生物。与仅使用单个组装器的方法相比，TransPi 产生更高的 BUSCO 完整性百分比，同时显着降低重复率。 TransPi 易于配置，并且可以使用 Conda、Docker 和 Singularity 无缝部署

转录组、组装、管道、nextflow

Submission: posted 18 February 2021
Recommendation: posted 27 June 2021, validated 19 July 2021

Cite this recommendation as:
Simakov, O. (2021) TransPI: A balancing act between transcriptome assemblers. Peer Community in Genomics, 100009. https://doi.org/10.24072/pci.genomics.100009

Recommendation

Ever since the introduction of the first widely usable assemblers for transcriptomic reads (Huang and Madan 1999; Schulz et al. 2012; Simpson et al. 2009; Trapnell et al. 2010, and many more), it has been a technical challenge to compare different methods and to choose the “right” or “best” assembly. It took years until the first widely accepted set of benchmarks beyond raw statistical evaluation became available (e.g., Parra, Bradnam, and Korf 2007; Simão et al. 2015)⁠⁠. However, an approach to find the right balance between the number of transcripts or isoforms vs. evolutionary completeness measures has been lacking. This has been particularly pronounced in the field of non-model organisms (i.e., wild species that lack a genomic reference). Often, studies in this area employed only one set of assembly tools (the most often used to this day being Trinity, Haas et al. 2013; Grabherr et al. 2011)⁠. While it was relatively straightforward to obtain an initial assembly, its validation, annotation, as well its application to the particular purpose that the study was designed for (phylogenetics, differential gene expression, etc) lacked a clear workflow. This led to many studies using a custom set of tools with ensuing various degrees of reproducibility.

TransPi (Rivera-Vicéns et al. 2021)⁠ fills this gap by first employing a meta approach using several available transcriptome assemblers and algorithms to produce a combined and reduced transcriptome assembly, then validating and annotating the resulting transcriptome. Notably, TransPI performs an extensive analysis/detection of chimeric transcripts, the results of which show that this new tool often produces fewer misassemblies compared to Trinity. TransPI not only generates a final report that includes the most important plots (in clickable/zoomable format) but also stores all relevant intermediate files, allowing advanced users to take a deeper look and/or experiment with different settings. As running TransPi is largely automated (including its installation via several popular package managers), it is very user-friendly and is likely to become the new "gold standard" for transcriptome analyses, especially of non-model organisms.

References

Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29, 644–652. https://doi.org/10.1038/nbt.1883

Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, MacManes MD, Ott M, Orvis J, Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, Henschel R, LeDuc RD, Friedman N, Regev A (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols, 8, 1494–1512. https://doi.org/10.1038/nprot.2013.084

Huang X, Madan A (1999) CAP3: A DNA Sequence Assembly Program. Genome Research, 9, 868–877. https://doi.org/10.1101/gr.9.9.868

Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23, 1061–1067. https://doi.org/10.1093/bioinformatics/btm071

Rivera-Vicéns RE, Garcia-Escudero CA, Conci N, Eitel M, Wörheide G (2021) TransPi – a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly. bioRxiv, 2021.02.18.431773, ver. 3 peer-reviewed and recommended by Peer Community in Genomics. https://doi.org/10.1101/2021.02.18.431773

Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics, 28, 1086–1092. https://doi.org/10.1093/bioinformatics/bts094

Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 31, 3210–3212. https://doi.org/10.1093/bioinformatics/btv351

Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol İ (2009) ABySS: A parallel assembler for short read sequence data. Genome Research, 19, 1117–1123. https://doi.org/10.1101/gr.089532.108

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28, 511–515. https://doi.org/10.1038/nbt.1621

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2021.02.18.431773

Version of the preprint: 1

Author's Reply, 16 Jun 2021

Download author's reply https://doi.org/10.24072/pci.genomics.100050.ar1

Decision by Oleg Simakov, posted 23 Mar 2021

The article presents a meta approach to transcriptome assembly, validation, and annotation. Based on multiple available tools, the TransPI aims to find the best suitable assembler/algorithm combination for a given set of data, followed by automated annotation. One of the main advantages of this approach is its flexibility in working on data from both "model" and "non-model" organisms and various levels of user expertise. TransPI also provides a clear and reproducible workflow. The manuscript it clearly written, and I will be happy to recommend it once the authors address the few points raised by the reviewers (in particular reviewer 2's concerns). These are very detailed and mostly can help phrase the manuscript better. They also include some important additional validation ideas, including an assessment of chimeric transcripts and merged/unmerged isoforms against ‘golden-standard’ data available for some of the species (i.e., not be limited to mainly BUSCO scores).

https://doi.org/10.24072/pci.genomics.100050.d1

Reviewed by Gustavo Sanchez, 11 Mar 2021

Rivera-Vicéns et al., report TransPi, a user-friendly pipeline for de novo transcriptome assembly, useful for non-model organisms. I like the efforts of the authors to compare the performance of TransPi with the popular assembler Trinity and the inclusion of datasets from different taxonomic ranks for their analysis. I am also surprised by the reduction performed in EvidetialGene (sometimes over 50%), perhaps being one of the additional and best steps implemented in TransPi. The interactive report at the end of the pipeline is also a significant advantage for us, researchers, who want to check our assembly's quality quickly before going to the next steps of the project.

I have only a few minor comments to improve the reading. Please see the revised version attached.

Download the review https://doi.org/10.24072/pci.genomics.100050.rev11

Reviewed by Juan Daniel Montenegro Cabrera, 16 Mar 2021

Download the review https://doi.org/10.24072/pci.genomics.100050.rev12

User comments

No user comments yet

or Register
Submit a preprint