PCI Genomics

244

Title

Efficient k-mer based curation of raw sequence data: application in *Drosophila suzukii*use asterix (*) to get italics

Authors

Gautier MathieuPlease use the format "First name initials family name" as in "Marie S. Curie, Niels H. D. Bohr, Albert Einstein, John R. R. Tolkien, Donna T. Strickland"

Year

2023

Picture

Abstract

<p>Several studies have highlighted the presence of contaminated entries in public sequence repositories, calling for special attention to the associated metadata. Here, we propose and evaluate a fast and efficient kmer-based approach to assess the degree of mislabeling or contamination. We applied it to high-throughput whole-genome raw sequence data for 236 Ind-Seq and 22 Pool-Seq samples of the invasive species Drosophila suzukii. We first used CLARK software to build a dictionary of species-discriminating kmers from the curated assemblies of 29 target drosophilid species (including D. melanogaster, D. simulans, D. subpulchrella or D. biarmipes) and 12 common drosophila pathogens and commensals (including Wolbachia). Counting the number of k-mers composing each query sample sequence that matched a discriminating k-mer from the dictionary provided a simple criterion for assignment to target species and evaluation of the entire sample. Analyses of a wide range of samples, representative of both target and other drosophilid species, demonstrated very good performance of the proposed approach, both in terms of run time and accuracy of sequence assignment. Of the 236 D. suzukii individuals, five were reassigned to D. simulans and eleven to D. subpulchrella. Another four showed moderate to substantial microbial contamination. Similarly, among the 22 Pool-Seq samples analyzed, two from the native range were found to be contaminated with 1 and 7 D. subpulchrella individuals, respectively (out of 50), and one from Europe was found to be contaminated with 5 to 6 D. immigrans individuals (out of 100). Overall, the present analysis allowed the definition of a large curated dataset consisting of >60 population samples representative of the worldwide genetic diversity, which may be valuable for further population genetics studies on D. suzukii. More generally, while we advocate careful sample identification and verification prior to sequencing, the proposed framework is simple and computationally efficient enough to be included as a routine post-hoc quality check prior to any data analysis and prior to data submission to public repositories.</p>

Indicate the full web address (DOI or URL) giving public access to these data (if you have any problems with the deposit of your data, please contact contact@genomics.peercommunityin.org). In case all raw data are included in the preprint, indicate the DOI or URL of the preprint.

https://doi.org/10.57745/HYTIBHYou should fill this box only if you chose 'All or part of the results presented in this preprint are based on data'. URL must start with http:// or https://

Indicate the full web address (DOI or URL) giving public access to these scripts (if you have any problems with the deposit of your scripts, please contact contact@genomics.peercommunityin.org). In case all raw scripts are included in the preprint, indicate the DOI or URL of the preprint.

https://doi.org/10.57745/HYTIBHYou should fill this box only if you chose 'Scripts were used to obtain or analyze the results'. URL must start with http:// or https://

Indicate the full web address (DOI, SWHID or URL) giving public access to these codes (if you have any problems with the deposit of your codes, please contact contact@genomics.peercommunityin.org). In case all raw codes are included in the preprint, indicate the DOI or URL of the preprint.

https://doi.org/10.57745/HYTIBHYou should fill this box only if you chose 'Codes have been used in this study'. URL must start with http:// or https://

Keywords

data curation, kmer, Drosophila suzukii, Pool-Seq, Ind-Seq

Methods that require specific expertise (optional)

NonePlease indicate the methods that may require specialised expertise during the peer review process (use a comma to separate various required expertises).

Thematic fields

Bioinformatics, Population genomics

Suggested reviewers - Suggest up to 10 reviewers (provide names and Email addresses). (Optional)

Alan Bergland [alan.bergland@virginia.edu], Joanna Chiu [jcchiu@ucdavis.edu], Marta Coronado-Zamora [marta.coronado@csic.es], Stefano Lonardi [stelo@cs.ucr.edu] No need for them to be recommenders of PCI Genomics. Please do not suggest reviewers for whom there might be a conflict of interest. Reviewers are not allowed to review preprints written by close colleagues (with whom they have published in the last four years, with whom they have received joint funding in the last four years, or with whom they are currently writing a manuscript, or submitting a grant proposal), or by family members, friends, or anyone for whom bias might affect the nature of the review - see the code of conduct

Opposed reviewers - Suggest up to 5 people not to invite as reviewers. (Optional)

e.g. John Doe [john@doe.com]

Submission date

2023-04-20 22:05:13

Recommender

Nicolas Galtier

Reviewers

or Register
Submit a preprint