Recommendation

Toward a critical assessment of virus detection in plants

Hadi Quesneville based on reviews by Alexander Suh and 1 anonymous reviewer

A recommendation of:

Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection

Lucie Tamisier, Annelies Haegeman, Yoika Foucart, Nicolas Fouillien, Maher Al Rwahnih, Nihal Buzkan, Thierry Candresse, Michela Chiumenti, Kris De Jonghe, Marie Lefebvre, Paolo Margaria, Jean Sébastien Reynard, Kristian Stevens, Denis Kutnjak, Sébastien Massart (2021), Zenodo, 4584718, ver. 4 peer-reviewed and recommended by Peer Community in Genomics https://doi.org/10.5281/zenodo.4584718

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection

The widespread use of High-Throughput Sequencing (HTS) for detection of plant viruses and sequencing of plant virus genomes has led to the generation of large amounts of data and of bioinformatics challenges to process them. Many bioinformatics pipelines for virus detection are available, making the choice of a suitable one difficult. A robust benchmarking is needed for the unbiased comparison of the pipelines, but there is currently a lack of reference datasets that could be used for this purpose. We present 7 semi-artificial datasets composed of real RNA-seq datasets from virus-infected plants spiked with artificial virus reads. Each dataset addresses challenges that could prevent virus detection. We also present 3 real datasets showing a challenging virus composition as well as 8 completely artificial datasets to test haplotype reconstruction software.

High-Throughput Sequencing, Reference data, Semi-artificial dataset, Plant virus detection, Bioinformatics pipelines, Haplotype reconstruction

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

مجموعات البيانات شبه الاصطناعية كمورد للتحقق من صحة خطوط أنابيب المعلوماتية الحيوية للكشف عن الفيروسات النباتية

أدى الاستخدام الواسع النطاق للتسلسل عالي الإنتاجية (HTS) للكشف عن الفيروسات النباتية وتسلسل جينومات الفيروسات النباتية إلى توليد كميات كبيرة من البيانات وتحديات المعلوماتية الحيوية لمعالجتها. تتوفر العديد من خطوط أنابيب المعلوماتية الحيوية للكشف عن الفيروسات، مما يجعل اختيار المسار المناسب أمرًا صعبًا. هناك حاجة إلى معايير قوية لإجراء مقارنة غير متحيزة لخطوط الأنابيب، ولكن يوجد حاليًا نقص في مجموعات البيانات المرجعية التي يمكن استخدامها لهذا الغرض. نقدم 7 مجموعات بيانات شبه اصطناعية تتألف من مجموعات بيانات RNA-seq الحقيقية من النباتات المصابة بالفيروسات والمزودة بقراءات الفيروسات الاصطناعية. تتناول كل مجموعة بيانات التحديات التي قد تمنع اكتشاف الفيروسات. نقدم أيضًا 3 مجموعات بيانات حقيقية توضح تركيبة الفيروس الصعبة بالإضافة إلى 8 مجموعات بيانات مصطنعة تمامًا لاختبار برنامج إعادة بناء النمط الفرداني.

التسلسل عالي الإنتاجية، البيانات المرجعية، مجموعة البيانات شبه الاصطناعية، اكتشاف فيروسات النبات، خطوط أنابيب المعلوماتية الحيوية، إعادة بناء النمط الفرداني

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Conjuntos de datos semiartificiales como recurso para la validación de procesos bioinformáticos para la detección de virus en plantas

El uso generalizado de la secuenciación de alto rendimiento (HTS) para la detección de virus de plantas y la secuenciación de genomas de virus de plantas ha llevado a la generación de grandes cantidades de datos y desafíos bioinformáticos para procesarlos. Hay disponibles muchos sistemas bioinformáticos para la detección de virus, lo que dificulta la elección de uno adecuado. Se necesita una evaluación comparativa sólida para realizar una comparación imparcial de los oleoductos, pero actualmente faltan conjuntos de datos de referencia que puedan usarse para este propósito. Presentamos 7 conjuntos de datos semiartificiales compuestos por conjuntos de datos reales de RNA-seq de plantas infectadas por virus enriquecidas con lecturas de virus artificiales. Cada conjunto de datos aborda desafíos que podrían impedir la detección de virus. También presentamos 3 conjuntos de datos reales que muestran una composición viral desafiante, así como 8 conjuntos de datos completamente artificiales para probar el software de reconstrucción de haplotipos.

Secuenciación de alto rendimiento, Datos de referencia, Conjunto de datos semiartificiales, Detección de virus vegetales, Tuberías bioinformáticas, Reconstrucción de haplotipos

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Ensembles de données semi-artificiels comme ressource pour la validation des pipelines bioinformatiques pour la détection des virus végétaux

L'utilisation généralisée du séquençage à haut débit (HTS) pour la détection des virus végétaux et le séquençage du génome des virus végétaux a conduit à la génération de grandes quantités de données et à des défis bioinformatiques pour les traiter. De nombreux pipelines bioinformatiques pour la détection des virus sont disponibles, ce qui rend difficile le choix d'un pipeline approprié. Une analyse comparative robuste est nécessaire pour une comparaison impartiale des pipelines, mais il manque actuellement des ensembles de données de référence qui pourraient être utilisés à cette fin. Nous présentons 7 ensembles de données semi-artificiels composés d'ensembles de données réels de séquençage d'ARN provenant de plantes infectées par des virus et enrichis de lectures de virus artificiels. Chaque ensemble de données répond aux défis qui pourraient empêcher la détection de virus. Nous présentons également 3 ensembles de données réels montrant une composition virale complexe ainsi que 8 ensembles de données complètement artificiels pour tester le logiciel de reconstruction d'haplotype.

Séquençage à haut débit, Données de référence, Ensemble de données semi-artificielles, Détection de virus végétaux, Pipelines bioinformatiques, Reconstruction d'haplotypes

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

पादप विषाणु का पता लगाने के लिए जैव सूचना विज्ञान पाइपलाइनों के सत्यापन के लिए एक संसाधन के रूप में अर्ध-कृत्रिम डेटासेट

पादप विषाणुओं का पता लगाने और पादप विषाणु जीनोम के अनुक्रमण के लिए हाई-थ्रूपुट सीक्वेंसिंग (एचटीएस) के व्यापक उपयोग के कारण बड़ी मात्रा में डेटा उत्पन्न हुआ है और उन्हें संसाधित करने के लिए जैव सूचना विज्ञान चुनौतियां सामने आई हैं। वायरस का पता लगाने के लिए कई जैव सूचना विज्ञान पाइपलाइन उपलब्ध हैं, जिससे उपयुक्त पाइपलाइन का चयन करना मुश्किल हो जाता है। पाइपलाइनों की निष्पक्ष तुलना के लिए एक मजबूत बेंचमार्किंग की आवश्यकता है, लेकिन वर्तमान में ऐसे संदर्भ डेटासेट की कमी है जिनका उपयोग इस उद्देश्य के लिए किया जा सकता है। हम 7 अर्ध-कृत्रिम डेटासेट प्रस्तुत करते हैं जो कृत्रिम वायरस रीड्स के साथ वायरस संक्रमित पौधों से प्राप्त वास्तविक आरएनए-सेक डेटासेट से बने होते हैं। प्रत्येक डेटासेट उन चुनौतियों का समाधान करता है जो वायरस का पता लगाने से रोक सकती हैं। हम चुनौतीपूर्ण वायरस संरचना दिखाने वाले 3 वास्तविक डेटासेट और साथ ही हैप्लोटाइप पुनर्निर्माण सॉफ़्टवेयर का परीक्षण करने के लिए 8 पूरी तरह से कृत्रिम डेटासेट भी प्रस्तुत करते हैं।

उच्च-थ्रूपुट अनुक्रमण, संदर्भ डेटा, अर्ध-कृत्रिम डेटासेट, प्लांट वायरस का पता लगाना, जैव सूचना विज्ञान पाइपलाइन, हाप्लोटाइप पुनर्निर्माण

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

植物ウイルス検出のためのバイオインフォマティクスパイプラインを検証するためのリソースとしての半人工データセット

植物ウイルスの検出および植物ウイルスゲノムの配列決定のためのハイスループットシーケンシング (HTS) の普及により、大量のデータが生成され、それらを処理するバイオインフォマティクスの課題が生じています。ウイルス検出用のバイオインフォマティクスパイプラインは多数利用可能であるため、適切なものを選択するのは困難です。パイプラインを公平に比較するには堅牢なベンチマークが必要ですが、現在、この目的に使用できる参照データセットが不足しています。人工ウイルス読み取りをスパイクしたウイルス感染植物からの実際の RNA-seq データセットで構成される 7 つの半人工データセットを紹介します。各データセットは、ウイルスの検出を妨げる可能性のある課題に対処します。また、困難なウイルス構成を示す 3 つの実際のデータセットと、ハプロタイプ再構成ソフトウェアをテストするための 8 つの完全に人工的なデータセットも紹介します。

ハイスループットシーケンシング、参照データ、半人工データセット、植物ウイルス検出、バイオインフォマティクスパイプライン、ハプロタイプ再構成

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Conjuntos de dados semi-artificiais como recurso para validação de pipelines de bioinformática para detecção de vírus em plantas

O uso generalizado de sequenciamento de alto rendimento (HTS) para detecção de vírus de plantas e sequenciamento de genomas de vírus de plantas levou à geração de grandes quantidades de dados e a desafios de bioinformática para processá-los. Muitos pipelines de bioinformática para detecção de vírus estão disponíveis, dificultando a escolha de um adequado. É necessária uma avaliação comparativa robusta para a comparação imparcial dos gasodutos, mas atualmente faltam conjuntos de dados de referência que possam ser utilizados para este fim. Apresentamos 7 conjuntos de dados semi-artificiais compostos por conjuntos de dados reais de RNA-seq de plantas infectadas por vírus com leituras artificiais de vírus. Cada conjunto de dados aborda desafios que podem impedir a detecção de vírus. Também apresentamos três conjuntos de dados reais mostrando uma composição viral desafiadora, bem como oito conjuntos de dados completamente artificiais para testar software de reconstrução de haplótipos.

Sequenciamento de alto rendimento, dados de referência, conjunto de dados semi-artificiais, detecção de vírus de plantas, pipelines de bioinformática, reconstrução de haplótipos

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Полуискусственные наборы данных как ресурс для проверки биоинформатических конвейеров для обнаружения растительных вирусов

Широкое использование высокопроизводительного секвенирования (HTS) для обнаружения вирусов растений и секвенирования геномов вирусов растений привело к созданию больших объемов данных и возникновению проблем биоинформатики для их обработки. Доступно множество биоинформатических конвейеров для обнаружения вирусов, что затрудняет выбор подходящего. Для объективного сравнения трубопроводов необходим надежный бенчмаркинг, но в настоящее время отсутствуют справочные наборы данных, которые можно было бы использовать для этой цели. Мы представляем 7 полуискусственных наборов данных, состоящих из реальных наборов данных РНК-секвенирования из инфицированных вирусом растений, в которые добавлены искусственные считывания вируса. Каждый набор данных решает проблемы, которые могут помешать обнаружению вирусов. Мы также представляем 3 реальных набора данных, демонстрирующих сложный состав вируса, а также 8 полностью искусственных наборов данных для тестирования программного обеспечения для реконструкции гаплотипов.

Высокопроизводительное секвенирование, справочные данные, полуискусственный набор данных, обнаружение растительных вирусов, конвейеры биоинформатики, реконструкция гаплотипов.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

半人工数据集作为验证植物病毒检测生物信息学管道的资源

高通量测序 (HTS) 广泛用于植物病毒检测和植物病毒基因组测序，导致产生大量数据，并给处理这些数据带来了生物信息学挑战。用于病毒检测的生物信息学管道有很多，因此很难选择合适的管道。需要强大的基准测试来对管道进行公正的比较，但目前缺乏可用于此目的的参考数据集。我们提出了 7 个半人工数据集，由来自病毒感染植物的真实 RNA-seq 数据集组成，其中添加了人工病毒读数。每个数据集都解决了可能阻止病毒检测的挑战。我们还提供了 3 个真实的数据集，显示了具有挑战性的病毒组成，以及 8 个完全人工的数据集来测试单倍型重建软件。

高通量测序、参考数据、半人工数据集、植物病毒检测、生物信息学流程、单倍型重建

Submission: posted 27 November 2020
Recommendation: posted 02 April 2021, validated 02 April 2021

Cite this recommendation as:
Quesneville, H. (2021) Toward a critical assessment of virus detection in plants. Peer Community in Genomics, 100007. https://doi.org/10.24072/pci.genomics.100007

Recommendation

The advent of High Throughput Sequencing (HTS) since the last decade has revealed previously unsuspected diversity of viruses as well as their (sometimes) unexpected presence in some healthy individuals. These results demonstrate that genomics offers a powerful tool for studying viruses at the individual level, allowing an in-depth inventory of those that are infecting an organism. Such approaches make it possible to study viromes with an unprecedented level of detail, both qualitative and quantitative, which opens new venues for analyses of viruses of humans, animals and plants. Consequently, the diagnostic field is using more and more HTS, fueling the need for efficient and reliable bioinformatics tools.

Many such tools have already been developed, but in plant disease diagnostics, validation of the bioinformatics pipelines used for the detection of viruses in HTS datasets is still in its infancy. There is an urgent need for benchmarking the different tools and algorithms using well-designed reference datasets generated for this purpose. This is a crucial step to move forward and to improve existing solutions toward well-standardized bioinformatics protocols. This context has led to the creation of the Plant Health Bioinformatics Network (PHBN), a Euphresco network project aiming to build a bioinformatics community working on plant health. One of their objectives is to provide researchers with open-access reference datasets allowing to compare and validate virus detection pipelines.

In this framework, Tamisier et al. [1] present real, semi-artificial, and completely artificial datasets, each aimed at addressing challenges that could affect virus detection. These datasets comprise real RNA-seq reads from virus-infected plants as well as simulated virus reads. Such a work, providing open-access datasets for benchmarking bioinformatics tools, should be encouraged as they are key to software improvement as demonstrated by the well-known success story of the protein structure prediction community: their pioneer community-wide effort, called Critical Assessment of protein Structure Prediction (CASP)[2], has been providing research groups since 1994 with an invaluable way to objectively test their structure prediction methods, thereby delivering an independent assessment of state-of-art protein-structure modelling tools. Following this success, many other bioinformatic community developed similar “competitions”, such as RNA-puzzles [3] to predict RNA structures, Critical Assessment of Function Annotation [4] to predict gene functions, Critical Assessment of Prediction of Interactions [5] to predict protein-protein interactions, Assemblathon [6] for genome assembly, etc. These are just a few examples from a long list of successful initiatives. Such efforts enable rigorous assessments of tools, stimulate the developers’ creativity, but also provide user communities with a state-of-art evaluation of available tools.

Inspired by these success stories, the authors propose a “VIROMOCK challenge” [7], asking researchers in the field to test their tools and to provide feedback on each dataset through a repository. This initiative, if well followed, will undoubtedly improve the field of virus detection in plants, but also probably in many other organisms. This will be a major contribution to the field of viruses, leading to better diagnostics and, consequently, a better understanding of viral diseases, thus participating in promoting human, animal and plant health.

References

[1] Tamisier, L., Haegeman, A., Foucart, Y., Fouillien, N., Al Rwahnih, M., Buzkan, N., Candresse, T., Chiumenti, M., De Jonghe, K., Lefebvre, M., Margaria, P., Reynard, J.-S., Stevens, K., Kutnjak, D. and Massart, S. (2021) Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection. Zenodo, 4273791, version 4 peer-reviewed and recommended by Peer community in Genomics. doi: https://doi.org/10.5281/zenodo.4273791

[2] Critical Assessment of protein Structure Prediction” (CASP) - https://en.wikipedia.org/wiki/CASP

[3] RNA-puzzles - https://www.rnapuzzles.org

[4] Critical Assessment of Function Annotation (CAFA) - https://en.wikipedia.org/wiki/Critical_Assessment_of_Function_Annotation

[5] Critical Assessment of Prediction of Interactions (CAPI) - https://en.wikipedia.org/wiki/Critical_Assessment_of_Prediction_of_Interactions

[6] Assemblathon - https://assemblathon.org

[7] VIROMOCK challenge - https://gitlab.com/ilvo/VIROMOCKchallenge

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
No indication

Reviews

Evaluation round #2

DOI or URL of the preprint: https://zenodo.org/record/4273792

Version of the preprint: 2

Author's Reply, 02 Apr 2021

The two references have been added

https://doi.org/10.24072/pci.genomics.100014.ar2

Decision by Hadi Quesneville, posted 15 Mar 2021

The responses brought by the autors to the reviewers are satisfactory. However two references in the text "Text S1" (line 117) and "Table S1" (line 125) cannot be found in the manuscript. The authors should fix this in order to have their preprint recommended.

https://doi.org/10.24072/pci.genomics.100014.d2

Evaluation round #1

DOI or URL of the preprint: https://zenodo.org/record/4293594

Version of the preprint: 1

Author's Reply, 05 Mar 2021

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.genomics.100014.ar1

Decision by Hadi Quesneville, posted 19 Jan 2021

Dear authors, The two referees found your article interesting and potentially of great value. However, it can be still improved according to their suggestions. I recommend you to take into account their suggestions and to re-submit it for a second evaluation round. Best regards, Hadi Quesneville

https://doi.org/10.24072/pci.genomics.100014.d1

Reviewed by anonymous reviewer 1, 18 Dec 2020

In this manuscript, the authors aim at describing several semi-artificial and artificial dataset of plant virus that could be used to benchmark bioinformatic pipelines for virus identification, allowing the assessment of their performance.

The initiative is very commendable and truly necessary with the number of bioinformatics tools developed today in all fields of biology. However, I have a real problem with this manuscript which seems to me insufficiently accomplished with a lack of information and precision.

The subject of the article is very specialized as it concerns the detection of plant viruses, this is why it is important to better introduce the subject.

There is a problem in the lack of explanation concerning the type of data allowing these detection or how they are obtained (from which biological data). Are they RNA-seq or DNA-seq data, or both? Do they come from purified extract from tissues (meaning are there steps of filtration to enrich in virus sequences or is there also host sequences)?

Likewise, it would be desirable to recall the existing bioinformatic tools or at least the approaches used depending on the questions asked to have an idea about the difficulties of these approaches.

The proposed dataset are also not very detailed nor the way they have been constructed. Especially concerning the real data. Sometimes figures would be useful to illustrate the text.

Another missing point is the lack of proof of principle to show examples in the use of at least some of these dataset and how they really allow a good benchmarking process.

Finally, the authors argue about the fact that having semi-artificial dataset allow to bypass the drawbacks of having either only real dataset or completely artificial dataset. This seems contradictory with the fact that the authors propose 3 real dataset and 9 artificial ones among the 18 dataset. Moreover, I think the semi-artificial dataset may also have some drawbacks that could be discussed. It could be possible that the drawbacks of both artificial and real dataset add up.

In sum, I think this work is needed since benchmarking bioinformatic tools is of utmost importance. However, this manuscript does not meet, at this stage, standards of scientific publications.

https://doi.org/10.24072/pci.genomics.100014.rev11

Reviewed by Alexander Suh, 19 Jan 2021

Tamisier et al. provide a combination of real and semi-artificial datasets with high relevance for benchmarking detection and analysis approaches in plant virus detection. The manuscript is succinct and well written, accompanied by a detailed GitLab repository, and proposes the VIROMOCK challenge as a community-driven effort to benchmark virus detection and analysis.

Below are some minor suggestions for improved clarity that the authors may want to implement to help a broad readership.

Line 86: It is unclear whether the read lengths vary within or between each data set. Table 1 suggests that the latter is mostly the case, however, then it would help the reader if the distinct sets of read lengths were stated here in the text.
Lines 91-94: Both for the real and artificial dataset, I recommend briefly discussing the potential issues arising from Illumina's recent shift from a four-channel system (e.g., HiSeq X) to a two-channel system (e.g. NovaSeq). A recent opinion piece by De-Kayne et al. (https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.13309) reviewed evidence for T>G errors in NovaSeq data and provided suggestions for how to deal with this. I assume this does not affect the datasets presented in the present manuscript (assuming all data here are based on HiSeq data or simulated on these), but this may be important to be pointed out for readers using NovaSeq data or HiSeq/NovaSeq combinations after benchmarking with the present datasets. Please also clarify in the text what system the present datasets are based or simulated on.
Line 101: Here and throughout, it may be unclear to some readers whether "non-complete genome" refers to the virus or the host.
Line 113: I commend the authors on preparing a very detailed GitLab repository. The Dryad download links appear to be working here, unlike the DOIs stated in Table 1. Please make sure that the DOIs stated in Table 1 are accessible, I was unable to have a look at the datasets through the Table 1 DOI links.
Line 145: Did the authors double-check that the random removal of reads led to complete absence of coverage for some genomic regions of these viruses, rather than reduced coverage for these regions?
Line 216: I like the diversity of challenging datasets discussed in the text and the authors' idea for the VIROMOCK challenge, however, for visual learners it might help to summarize key points in a figure. If the authors agree that this would help, consider providing simplified illustrations of virus detection/analysis challenges (with pointers to datasets 1-18), and/or the suggested community-driven approach of the VIROMOCK challenge.
Table 1: In the modification column, consider stating the number of reads (or read pairs) added, and possibly also the number of strains.
Table 1: In the "Challenge" column, it is not always clear which virus a specific "mutation" or "strain" refers to. Please revise for clarity by adding as much information as space allows.

https://doi.org/10.24072/pci.genomics.100014.rev12

User comments

No user comments yet

or Register
Submit a preprint