PCI Genomics

200

Title *

MATEdb, a data repository of high-quality metazoan transcriptome assemblies to accelerate phylogenomic studiesuse asterix (*) to get italics

Authors *

Rosa Fernandez, Vanina Tonzo, Carolina Simon Guerrero, Jesus Lozano-Fernandez, Gemma I Martinez-Redondo, Pau Balart-Garcia, Leandro Aristide, Klara Eleftheriadi, Carlos Vargas-ChavezPlease use the format "First name initials family name" as in "Marie S. Curie, Niels H. D. Bohr, Albert Einstein, John R. R. Tolkien, Donna T. Strickland"

Year *

2022

Picture *

Abstract *

<p style="text-align: justify;">With the advent of high throughput sequencing, the amount of genomic data available for animals (Metazoa) species has bloomed over the last decade, especially from transcriptomes due to lower sequencing costs and easier assembling process compared to genomes. Transcriptomic data sets have proven useful for phylogenomic studies, such as inference of phylogenetic interrelationships (e.g., species tree reconstruction) and comparative genomics analyses (e.g., gene repertoire evolutionary dynamics). However, these data sets are often analyzed following different analytical pipelines, particularly including different software versions, leading to potential methodological biases when analyzed jointly in a comparative framework. Moreover, these analyses are computationally expensive and not affordable for a large part of the scientific community. More importantly, assembled transcriptomes are usually not deposited in public databases. Furthermore, the quality of these data sets is hardly ever taken into consideration, potentially impacting subsequent analyses such as orthology and phylogenetic or gene repertoire evolution inference. To alleviate these issues, we present Metazoan Assemblies from Transcriptomic Ensembles (MATEdb), a curated database of 335 high-quality transcriptome assemblies from different animal phyla analyzed following the same pipeline. The repository is composed, for each species, of (1) a de novo transcriptome assembly, (2) its candidate coding regions within transcripts (both at the level of nucleotide and amino acid sequences), (3) the coding regions filtered using their contamination profile (i.e., only metazoan content), (4) the longest isoform of the amino acid candidate coding regions, (5) the gene content completeness score as assessed against the BUSCO database, and (6) an orthology-based gene annotation. We complement the repository with gene annotations from high-quality genomes, which are often not straightforward to obtain from individual sequencing projects, totalling 423 high-quality genomic and transcriptomic data sets. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, that will be analyzed following the same pipeline and be permanently stored in public repositories. We believe that MATEdb will accelerate research on animal phylogenomics while saving thousands of hours of computational work in a plea for open and collaborative science. </p>

Indicate the full web address (DOI or URL) giving public access to these data (if you have any problems with the deposit of your data, please contact contact@genomics.peercommunityin.org). In case all raw data are included in the preprint, indicate the DOI or URL of the preprint. *

You should fill this box only if you chose 'All or part of the results presented in this preprint are based on data'. URL must start with http:// or https://

Indicate the full web address (DOI or URL) giving public access to these scripts (if you have any problems with the deposit of your scripts, please contact contact@genomics.peercommunityin.org). In case all raw scripts are included in the preprint, indicate the DOI or URL of the preprint. *

https://github.com/MetazoaPhylogenomicsLab/MATEdbYou should fill this box only if you chose 'Scripts were used to obtain or analyze the results'. URL must start with http:// or https://

Indicate the full web address (DOI, SWHID or URL) giving public access to these codes (if you have any problems with the deposit of your codes, please contact contact@genomics.peercommunityin.org). In case all raw codes are included in the preprint, indicate the DOI or URL of the preprint. *

https://github.com/MetazoaPhylogenomicsLab/MATEdbYou should fill this box only if you chose 'Codes have been used in this study'. URL must start with http:// or https://

Keywords (optional)

Transctiptome; Assembly; Invertebrates; non-model organisms; CDS; Annotation

Methods that require specific expertise (optional)

NonePlease indicate the methods that may require specialised expertise during the peer review process (use a comma to separate various required expertises).

Thematic fields *

Bioinformatics, Evolutionary genomics, Functional genomics

Suggested reviewers - Suggest up to 10 reviewers (provide names and Email addresses). (Optional)

e.g. John Doe john@doe.com

No need for them to be recommenders of PCI Genomics. Please do not suggest reviewers for whom there might be a conflict of interest. Reviewers are not allowed to review preprints written by close colleagues (with whom they have published in the last four years, with whom they have received joint funding in the last four years, or with whom they are currently writing a manuscript, or submitting a grant proposal), or by family members, friends, or anyone for whom bias might affect the nature of the review - see the code of conduct

Opposed reviewers - Suggest up to 5 people not to invite as reviewers. (Optional)

e.g. John Doe john@doe.com

Submission date

2022-07-20 07:30:39

Recommender

Samuel Abalde

Reviewers

or Register
Submit a preprint