Recommendation

EukProt enables reproducible Eukaryota-wide protein sequence analyses

Gavin Douglas based on reviews by 2 anonymous reviewers

A recommendation of:

EukProt: A database of genome-scale predicted proteins across the diversity of eukaryotes

Daniel J. Richter, Cédric Berney, Jürgen F. H. Strassert, Yu-Ping Poh, Emily K. Herman, Sergio A. Muñoz-Gómez, Jeremy G. Wideman, Fabien Burki, Colomban de Vargas (2022), bioRxiv, 2020.06.30.180687, ver. 5 peer-reviewed and recommended by Peer Community in Genomics https://doi.org/10.1101/2020.06.30.180687

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

EukProt: A database of genome-scale predicted proteins across the diversity of eukaryotes

EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available at http://evocellbio.com/eukprot/ . We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification.

eukaryotes, eukaryotic diversity, transcriptomes, genomes, single-cell genomes, phylogenomics, UniEuk taxonomy, predicted proteins

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

EukProt: قاعدة بيانات للبروتينات المتوقعة على نطاق الجينوم عبر تنوع حقيقيات النوى

EukProt هي قاعدة بيانات لمجموعات البروتين المتوقعة المنشورة والمتاحة للجمهور والتي تم اختيارها لتمثيل اتساع نطاق التنوع حقيقي النواة، بما في ذلك حاليًا 993 نوعًا من جميع المجموعات الفائقة الرئيسية بالإضافة إلى الأصناف اليتيمة. الهدف من قاعدة البيانات هو توفير مصدر واحد مناسب للأبحاث القائمة على الجينات عبر طيف الحياة حقيقية النواة، مثل علم السلالات وتطور عائلة الجينات. يتم وضع كل نوع ضمن الإطار التصنيفي UniEuk من أجل تسهيل التحليلات النهائية، وترتبط كل مجموعة بيانات بمعرف فريد ومستمر لتسهيل المقارنة والتكرار بين التحليلات. يتم تحديث قاعدة البيانات بانتظام، وسيتم تخزين جميع الإصدارات بشكل دائم وإتاحتها عبر FigShare. يحتوي الإصدار الحالي على عدد من التحديثات، أبرزها "المجموعة المقارنة" (TCS)، وهي مجموعة تصنيفية مخفضة ذات اكتمال تقديري عالٍ مع الحفاظ على اتساع كبير للنشوء والتطور، والذي يضم 196 بروتينًا متوقعًا. يتوفر خادم ويب BLAST وعروض رسومية لاكتمال مجموعة البيانات على http://evocellbio.com/eukprot / . نحن ندعو المجتمع إلى تقديم اقتراحات لمجموعات البيانات الجديدة وميزات التعليقات التوضيحية الجديدة التي سيتم تضمينها في الإصدارات اللاحقة، بهدف بناء مورد تعاوني من شأنه تعزيز البحث لفهم تنوع حقيقيات النوى وتنوعها.

حقيقيات النوى، تنوع حقيقيات النوى، النسخ، الجينومات، جينومات الخلية الواحدة، علم النشوء والتطور، تصنيف UniEuk، البروتينات المتوقعة

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

EukProt: una base de datos de proteínas predichas a escala genómica en toda la diversidad de eucariotas

EukProt es una base de datos de conjuntos de proteínas previstas publicadas y disponibles públicamente, seleccionadas para representar la amplitud de la diversidad eucariota, que actualmente incluye 993 especies de todos los supergrupos principales, así como taxones huérfanos. El objetivo de la base de datos es proporcionar un recurso único y conveniente para la investigación basada en genes en todo el espectro de la vida eucariota, como la filogenómica y la evolución de las familias de genes. Cada especie se ubica dentro del marco taxonómico de UniEuk para facilitar los análisis posteriores, y cada conjunto de datos está asociado con un identificador único y persistente para facilitar la comparación y replicación entre análisis. La base de datos se actualiza periódicamente y todas las versiones se almacenarán permanentemente y estarán disponibles a través de FigShare. La versión actual tiene una serie de actualizaciones, en particular 'The Comparative Set' (TCS), un conjunto taxonómico reducido con una alta integridad estimada al tiempo que mantiene una amplitud filogenética sustancial, que comprende 196 proteomas predichos. Un servidor web BLAST y visualizaciones gráficas de la integridad del conjunto de datos están disponibles en http://evocellbio.com/eukprot / . Invitamos a la comunidad a brindar sugerencias sobre nuevos conjuntos de datos y nuevas funciones de anotación que se incluirán en versiones posteriores, con el objetivo de crear un recurso colaborativo que promueva la investigación para comprender la diversidad y diversificación de los eucariotas.

eucariotas, diversidad eucariota, transcriptomas, genomas, genomas unicelulares, filogenómica, taxonomía UniEuk, proteínas predichas

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

EukProt : une base de données de protéines prédites à l'échelle du génome dans la diversité des eucaryotes

EukProt est une base de données d'ensembles de protéines prédites publiées et accessibles au public, sélectionnées pour représenter l'étendue de la diversité eucaryote, comprenant actuellement 993 espèces de tous les principaux supergroupes ainsi que des taxons orphelins. L'objectif de la base de données est de fournir une ressource unique et pratique pour la recherche génétique sur tout le spectre de la vie eucaryote, comme la phylogénomique et l'évolution des familles de gènes. Chaque espèce est placée dans le cadre taxonomique UniEuk afin de faciliter les analyses en aval, et chaque ensemble de données est associé à un identifiant unique et persistant pour faciliter la comparaison et la réplication entre les analyses. La base de données est régulièrement mise à jour et toutes les versions seront stockées en permanence et mises à disposition via FigShare. La version actuelle comporte un certain nombre de mises à jour, notamment « The Comparative Set » (TCS), un ensemble taxonomique réduit avec une exhaustivité estimée élevée tout en conservant une étendue phylogénétique substantielle, qui comprend 196 protéomes prédits. Un serveur Web BLAST et des affichages graphiques de l'exhaustivité des ensembles de données sont disponibles sur http://evocellbio.com/eukprot / . Nous invitons la communauté à proposer des suggestions pour de nouveaux ensembles de données et de nouvelles fonctionnalités d'annotation à inclure dans les versions ultérieures, dans le but de créer une ressource collaborative qui favorisera la recherche pour comprendre la diversité et la diversification eucaryotes.

eucaryotes, diversité eucaryote, transcriptomes, génomes, génomes unicellulaires, phylogénomique, taxonomie UniEuk, protéines prédites

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

यूकेप्रोट: यूकेरियोट्स की विविधता में जीनोम-स्केल अनुमानित प्रोटीन का एक डेटाबेस

EukProt प्रकाशित और सार्वजनिक रूप से उपलब्ध पूर्वानुमानित प्रोटीन सेटों का एक डेटाबेस है, जिसे यूकेरियोटिक विविधता की चौड़ाई का प्रतिनिधित्व करने के लिए चुना गया है, जिसमें वर्तमान में सभी प्रमुख सुपरग्रुप्स के साथ-साथ अनाथ टैक्सा की 993 प्रजातियां शामिल हैं। डेटाबेस का लक्ष्य यूकेरियोटिक जीवन के स्पेक्ट्रम में जीन-आधारित अनुसंधान के लिए एकल, सुविधाजनक संसाधन प्रदान करना है, जैसे कि फ़ाइलोजेनोमिक्स और जीन परिवार विकास। डाउनस्ट्रीम विश्लेषणों को सुविधाजनक बनाने के लिए प्रत्येक प्रजाति को UniEuk टैक्सोनोमिक ढांचे के भीतर रखा गया है, और प्रत्येक डेटा सेट विश्लेषणों के बीच तुलना और प्रतिकृति की सुविधा के लिए एक अद्वितीय, लगातार पहचानकर्ता के साथ जुड़ा हुआ है। डेटाबेस नियमित रूप से अद्यतन किया जाता है, और सभी संस्करण स्थायी रूप से संग्रहीत किए जाएंगे और फिगशेयर के माध्यम से उपलब्ध कराए जाएंगे। वर्तमान संस्करण में कई अपडेट हैं, विशेष रूप से 'द कम्पेरेटिव सेट' (टीसीएस), एक कम टैक्सोनॉमिक सेट, जिसमें पर्याप्त फ़ाइलोजेनेटिक चौड़ाई बनाए रखते हुए उच्च अनुमानित पूर्णता है, जिसमें 196 अनुमानित प्रोटिओम शामिल हैं। एक ब्लास्ट वेब सर्वर और डेटा सेट पूर्णता का ग्राफिकल डिस्प्ले http://evocelbio.com/eukprot/ पर उपलब्ध है। हम समुदाय को एक सहयोगी संसाधन बनाने के लक्ष्य के साथ बाद के संस्करणों में शामिल किए जाने वाले नए डेटा सेट और नई एनोटेशन सुविधाओं के लिए सुझाव देने के लिए आमंत्रित करते हैं जो यूकेरियोटिक विविधता और विविधीकरण को समझने के लिए अनुसंधान को बढ़ावा देगा।

यूकेरियोट्स, यूकेरियोटिक विविधता, ट्रांसक्रिप्टोम, जीनोम, एकल-कोशिका जीनोम, फ़ाइलोजेनोमिक्स, यूनीयूक वर्गीकरण, अनुमानित प्रोटीन

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

EukProt: 多様な真核生物にわたるゲノムスケールの予測タンパク質のデータベース

EukProt は、広範な真核生物の多様性を表すために選択された、公開および公開されている予測タンパク質セットのデータベースであり、現在、すべての主要なスーパーグループおよび孤児分類群からの 993 種が含まれています。このデータベースの目標は、系統ゲノミクスや遺伝子ファミリーの進化など、真核生物の範囲にわたる遺伝子ベースの研究に単一の便利なリソースを提供することです。下流の分析を容易にするために、各種は UniEuk 分類学の枠組み内に配置され、各データセットは分析間の比較と複製を容易にするために固有の永続的な識別子に関連付けられます。データベースは定期的に更新され、すべてのバージョンは永久に保存され、FigShare 経由で利用できるようになります。現在のバージョンには多数の更新があり、特に「比較セット」(TCS) は、196 個の予測プロテオームで構成され、かなりの系統学的広がりを維持しながら、高い推定完全性を備えた縮小分類セットです。 BLAST Web サーバーとデータセットの完全性のグラフィカル表示は、 http://evocellbio.com/eukprot で利用できます。 / 。真核生物の多様性と多様化を理解するための研究を促進する共同リソースを構築することを目的として、後続のバージョンに含まれる新しいデータセットや新しいアノテーション機能に関する提案をコミュニティに提供することを呼びかけています。

真核生物、真核生物の多様性、トランスクリプトーム、ゲノム、単細胞ゲノム、系統ゲノミクス、UniEuk 分類学、予測タンパク質

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

EukProt: Um banco de dados de proteínas previstas em escala genômica através da diversidade de eucariotos

EukProt é um banco de dados de conjuntos de proteínas previstas publicadas e publicamente disponíveis, selecionadas para representar a amplitude da diversidade eucariótica, incluindo atualmente 993 espécies de todos os principais supergrupos, bem como táxons órfãos. O objetivo do banco de dados é fornecer um recurso único e conveniente para pesquisas baseadas em genes em todo o espectro da vida eucariótica, como filogenômica e evolução de famílias genéticas. Cada espécie é colocada dentro da estrutura taxonômica UniEuk para facilitar as análises posteriores, e cada conjunto de dados é associado a um identificador único e persistente para facilitar a comparação e replicação entre as análises. O banco de dados é atualizado regularmente e todas as versões serão armazenadas permanentemente e disponibilizadas via FigShare. A versão atual tem uma série de atualizações, nomeadamente 'The Comparative Set' (TCS), um conjunto taxonômico reduzido com alta estimativa de completude, mantendo uma amplitude filogenética substancial, que compreende 196 proteomas previstos. Um servidor web BLAST e exibições gráficas da integridade do conjunto de dados estão disponíveis em http://evocellbio.com/eukprot / . Convidamos a comunidade a fornecer sugestões para novos conjuntos de dados e novos recursos de anotação a serem incluídos em versões subsequentes, com o objetivo de construir um recurso colaborativo que promoverá pesquisas para compreender a diversidade e diversificação eucariótica.

eucariotos, diversidade eucariótica, transcriptomas, genomas, genomas unicelulares, filogenômica, taxonomia UniEuk, proteínas previstas

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

EukProt: база данных предсказанных в масштабе генома белков всего разнообразия эукариот.

EukProt — это база данных опубликованных и общедоступных предсказанных наборов белков, выбранных для представления всего разнообразия эукариот, в настоящее время включающего 993 вида из всех основных супергрупп, а также таксонов-сирот. Цель базы данных — предоставить единый удобный ресурс для генных исследований по всему спектру эукариотической жизни, таких как филогеномика и эволюция семейства генов. Каждый вид помещен в таксономическую структуру UniEuk, чтобы облегчить последующий анализ, а каждый набор данных связан с уникальным постоянным идентификатором для облегчения сравнения и репликации между анализами. База данных регулярно обновляется, все версии будут постоянно храниться и доступны через FigShare. Текущая версия содержит ряд обновлений, в частности «Сравнительный набор» (TCS), сокращенный таксономический набор с высокой оценочной полнотой при сохранении значительной филогенетической широты, которая включает 196 предсказанных протеомов. Веб-сервер BLAST и графическое отображение полноты набора данных доступны по адресу http://evocellbio.com/eukprot / . Мы приглашаем сообщество вносить предложения по новым наборам данных и новым функциям аннотаций, которые будут включены в последующие версии, с целью создания совместного ресурса, который будет способствовать исследованиям, направленным на понимание эукариотического разнообразия и диверсификации.

эукариоты, эукариотическое разнообразие, транскриптомы, геномы, одноклеточные геномы, филогеномика, таксономия UniEuk, предсказанные белки

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

EukProt：跨真核生物多样性的基因组规模预测蛋白质数据库

EukProt 是一个包含已发表和公开的预测蛋白质集的数据库，这些蛋白质集被选择来代表真核生物多样性的广度，目前包括来自所有主要超类群的 993 个物种以及孤儿类群。该数据库的目标是为整个真核生物领域的基因研究提供单一、便捷的资源，例如系统发育组学和基因家族进化。每个物种都被放置在 UniEuk 分类框架内，以便于下游分析，每个数据集都与一个唯一的、持久的标识符相关联，以促进分析之间的比较和复制。该数据库定期更新，所有版本都将永久存储并通过 FigShare 提供。当前版本有许多更新，特别是“比较集”(TCS)，这是一个简化的分类集，具有较高的估计完整性，同时保持了相当大的系统发育广度，其中包括 196 个预测蛋白质组。 BLAST Web 服务器和数据集完整性的图形显示位于 http://evocellbio.com/eukprot / 。我们邀请社区为后续版本中包含的新数据集和新注释功能提供建议，目的是建立一个协作资源，促进了解真核生物多样性和多样化的研究。

真核生物、真核生物多样性、转录组、基因组、单细胞基因组、系统发育组学、UniEuk 分类学、预测蛋白质

Submission: posted 08 June 2022
Recommendation: posted 10 September 2022, validated 15 September 2022

Cite this recommendation as:
Douglas, G. (2022) EukProt enables reproducible Eukaryota-wide protein sequence analyses. Peer Community in Genomics, 100021. https://doi.org/10.24072/pci.genomics.100021

Recommendation

Comparative genomics is a general approach for understanding how genomes differ, which can be considered from many angles. For instance, this approach can delineate how gene content varies across organisms, which can lead to novel hypotheses regarding what those organisms do. It also enables investigations into the sequence-level divergence of orthologous DNA, which can provide insight into how evolutionary forces differentially shape genome content and structure across lineages.

Such comparisons are often restricted to protein-coding genes, as these are sensible units for assessing putative function and for identifying homologous matches in divergent genomes. Although information is lost by focusing only on the protein-coding portion of genomes, this simplifies analyses and has led to crucial findings in recent years. Perhaps most dramatically, analyses based on hundreds of orthologous proteins across microbial eukaryotes are fundamentally changing our understanding of the eukaryotic tree of life (Burki et al. 2020).

These and other topics are highlighted in a new pre-print from Dr. Daniel Richter and colleagues, which describes EukProt (Richter et al. 2022): a database containing protein sets from 993 eukaryotic species. The authors provide a BLAST portal for matching custom sequences against this database (https://evocellbio.com/eukprot/) and the entire database is available for download (https://doi.org/10.6084/m9.figshare.12417881.v3). They also provide a subset of their overall dataset, ‘The Comparative Set’, which contains only high-quality proteomes and is meant to maximize phylogenetic diversity.

There are two major advantages of EukProt:

1. It will enable researchers to quickly compare proteomes and perform phylogenomic analyses, without needing the skills or the time commitment to aggregate and process these data. The authors make it clear that acquiring the raw protein sets was non-trivial, as they were distributed across a wide variety of online repositories (some of which are no longer accessible!).

2. Analyses based on this database will be more reproducible and easily compared across studies than those based on custom-made databases for individual studies. This is because the EukProt authors followed FAIR principles (Wilkinson et al. 2016) when building their database, which is a set of guidelines for enhancing data reusability. So, for instance, each proteome has a unique identifier in EukProt, and all species are annotated in a unified taxonomic framework, which will aid in standardizing comparisons across studies.

The authors make it clear that there is still work to be done. For example, there is an uneven representation of proteomes across different eukaryotic lineages, which can only be addressed by further characterization of poorly studied lineages. In addition, the authors note that it would ultimately be best for the EukProt database to be integrated into an existing large-scale repository, like NCBI, which would help ensure that important eukaryotic diversity was not ignored. Nonetheless, EukProt represents an excellent example of how reproducible bioinformatics resources should be designed and should prove to be an extremely useful resource for the field.

References

Burki F, Roger AJ, Brown MW, Simpson AGB (2020) The New Tree of Eukaryotes. Trends in Ecology & Evolution, 35, 43–55. https://doi.org/10.1016/j.tree.2019.08.008

Richter DJ, Berney C, Strassert JFH, Poh Y-P, Herman EK, Muñoz-Gómez SA, Wideman JG, Burki F, Vargas C de (2022) EukProt: A database of genome-scale predicted proteins across the diversity of eukaryotes. bioRxiv, 2020.06.30.180687, ver. 5 peer-reviewed and recommended by Peer Community in Genomics. https://doi.org/10.1101/2020.06.30.180687

Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJG, Groth P, Goble C, Grethe JS, Heringa J, ’t Hoen PAC, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone S-A, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B (2016) The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://doi.org/10.1038/sdata.2016.18

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/2020.06.30.180687

Version of the preprint: 3

Author's Reply, 09 Sep 2022

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.genomics.100167.ar2

Decision by Gavin Douglas, posted 30 Aug 2022

To Dr. Richter and colleagues,

I think your changes have addressed most of the reviewer’s comments (except for one minor comment – see below) and I think the manuscript is in excellent condition, and requires only a small tweak prior to recommendation.

One important thing to note is that I received a “timed out” error when trying to load http://evocellbio.com/eukprot/ - I’m guessing this was just a transient problem, but should be checked.

The minor comment that I think the authors perhaps missed was this partial statement from reviewer 1:
“…mention the fact that they cannot technically evaluate the tools and parameters selection for the de novo transcriptome assembly paragraphs (lines 300-306) and the automated genome annotation (lines 329-338)”

Those line numbers no longer match, but the sections correspond to the paragraphs starting with “‘assemble mRNA’: de novotranscriptome assembly “ and “‘predict genes’: we used EukMetaSanity”, respectively. I think either a little more explanation of why these parameters were chosen (e.g., why stating why using the same parameters as Alexander et al. 2021 makes sense, in the case of the predict genes). If the options are somewhat arbitrary, which might be the case with the assembly and filtering options, then the authors could mention that these options were not evaluated but are similar to what are commonly used, which I believe would address the reviewer’s point.

Last, I recommend two very minor changes:

In your title, I recommend that you change “a database” after the “Eukprot:” to be “A database”. I believe that most style guides suggest the latter format, but the former is widespread in the scientific literature so I leave that choice to you.

I do however strongly think that the link to the webserver should be added to the abstract, which I think many readers comes to expect when reading about bioinformatics resource.

Once these last changes are addressed I would be pleased to recommend your article.

All the best,

Gavin Douglas

https://doi.org/10.24072/pci.genomics.100167.d2

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2020.06.30.180687

Version of the preprint: 1

Author's Reply, 29 Aug 2022

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.genomics.100167.ar1

Decision by Gavin Douglas, posted 14 Jul 2022

Two reviewers have completed their assessments and have determined that the manuscript is sound and describes a useful resource for the field. They have requested only minor revisions.

I share their enthusiasm for this resource and look forward to seeing the next draft of this manuscript.

I concur with reviewer #2 that more detailed discussion regarding how their resource differs from PhyloDB is needed. I also did not find Table 1 very informative, and I would think a supplementary table listing the actual URLs used would be more relevant to interested readers. But since the reviewers did not take issue with the table, I will not require it to be changed, and leave the decision to the authors’ preference.

Given the small changes requested, the authors should aim to format their manuscript for according to PCI Genomics guidelines (see: https://genomics.peercommunityin.org/help/guide_for_authors#h_3273113785671619705234847)

Formatting issues that I noticed are

- Table and Figure should be embedded within the text.

- An email for the corresponding author should be indicated

- Rather than a database availability statement, this should be moved to the end of the abstract (for the link to the database webserver), and also mentioned in the “Data, script and code availability” section at the end of the manuscript.

- I believe moving all of the descriptions of the database to a “Results and Discussion” heading would be most appropriate for this article type (and the current headings, such as “The EukProt Database” changed to sub-headings). Based on the formatting guidelines, PCI Genomics strongly recommends separate Results and Discussion headers, but I think a combined section would be acceptable in this case, as the manuscript is very clear as it is.

- The methods section should be moved before the Results section

- Please re-format the acknowledgements section to match the recommended format. Also, is the lower-case “i” in Núria Ros i Rocher a typo? I think this is supposed to be a hyphen.

- Please add a Data, script, and code availability section at the end. Note that the authors’ custom code must also be made available in this section.

- Make in-text citations square brackets when they are within parenthetic phrases (e.g., “Eisen, 2003” at L48.

https://doi.org/10.24072/pci.genomics.100167.d1

Reviewed by anonymous reviewer 1, 11 Jul 2022

The present manuscript presents the protein based database EukProt that has been build on reference data from genome, single cells and transcriptomes. This update aligns with the FAIR principles and introduces a new high quality reference dataset that was explicitly setup for comparative genomics and that tries to meet a high taxonomic standard that aligns with UniEuk.

The manuscript is well justified and clear in its description and outlines. Thus, the only critique that I have that it missed to point the limitations of EukProt in a specific manner. For future users, however, the limitations are as important as the strenghts, in particular, when used as reference for the whole scientific community.
Therefore, I'd like to recommend to add a small paragraph that points out the limitation of the database. This could for instance highlight cases, in which the database will be of only limited use (e.g. a list of lineages that are not well covered (to balance the statement of the lineages that had more than 100 taxa) could be pointed out here and that still require joined sequencing efforts; similarily this could be pointed out for the comparative genomics), or limitations of the current gene prediction models, taxonomic paths, ... (not all maybe neccessary though, but I'd least mention/discuss the most important ones for the users)

Thank you

https://doi.org/10.24072/pci.genomics.100167.rev11

Reviewed by anonymous reviewer 2, 13 Jul 2022

This article presents the release of EukProt: a database of eukaryotic genome-scale predicted proteins. The manuscript nicely outlines the pitfalls in shared genomics data accessibility and presents EukProt as a solution for several challenges of comparative genomics analyses, which will become even stronger with the exponential increase in genomic data production. It then continues by describing the database utilities, downloadables, generic structure, abidance to FAIR principles, and community-provided update possibilities and finishes with a detailed description of the methodology.

The title and abstract are clear and straight to the point. Overall the article excellently stands out for its clarity, detailed methodology, input database specifications, comprehensiveness, and range of bioinformatics challenges that the authors address with the development of this resource. The amount of considered repositories from which the database is constructed is impressive, and so is the subsequent integration of custom processed raw data (assemblies, annotations). The authors have a clear and deep knowledge of the comparative genomics issues that the scientific community is facing and provide an elegant solution through a genomic analysis framework enriched with some of the most solid and state-of-art comparative genomics tools (examples: the UniEuk taxonomic framework, BUSCO completeness scores). It indicates particular sensitivity and integrity of the authors toward a modern (e.g. foreseeing the ocean metagenomics data integration) and virtuous (e.g. providing various downloadables such as genome annotations) way of approaching bioinformatics resource development. This sensitivity is mostly exemplified by the presentation of The Comparative Set (TCS), a selection of taxonomically fairly-distributed, highly complete predicted-protein sets, which will hopefully serve as a basis for many comparative genomics analyses in future eukaryotic biology studies.

This reviewer will only provide a few minor comments about the clarity of some sentences, as well as mention the fact that they cannot technically evaluate the tools and parameters selection for the de novo transcriptome assembly paragraphs (lines 300-306) and the automated genome annotation (lines 329-338). This reviewer particularly praises the care given to the methods producing The Comparative Set.

This reviewer would be happy to see this resource further expand and recommends it for PCI Genomics validation.

Minor comments:

Lines 68-70: The authors could better explain how EukProt differentiates from PhyloDB.

Lines 73-75: This reviewer could not find protein data files comprising protein domains, Interpro, or gene ontologies from the downloadables (genome annotations, protein fasta files). Not clear if they are provided or if they are mentioned as an example of data with difficult accessibility. Either way, it could be better explained. EDIT: found the mention of potential addition in the future at lines 205-208, this reviewer would still advise rephrasing lines 73-75 for immediate clarity.

https://doi.org/10.24072/pci.genomics.100167.rev12

User comments

No user comments yet

or Register
Submit a preprint