Recommendation

Decontaminating reads, not contigs

Nicolas Galtier based on reviews by Marie Cariou and Denis Baurain

A recommendation of:

Efficient k-mer based curation of raw sequence data: application in Drosophila suzukii

Gautier Mathieu (2023), bioRxiv, ver.2, peer-reviewed and recommended by PCI Genomics https://doi.org/10.1101/2023.04.18.537389

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Efficient k-mer based curation of raw sequence data: application in Drosophila suzukii

Several studies have highlighted the presence of contaminated entries in public sequence repositories, calling for special attention to the associated metadata. Here, we propose and evaluate a fast and efficient kmer-based approach to assess the degree of mislabeling or contamination. We applied it to high-throughput whole-genome raw sequence data for 236 Ind-Seq and 22 Pool-Seq samples of the invasive species Drosophila suzukii. We first used CLARK software to build a dictionary of species-discriminating kmers from the curated assemblies of 29 target drosophilid species (including D. melanogaster, D. simulans, D. subpulchrella or D. biarmipes) and 12 common drosophila pathogens and commensals (including Wolbachia). Counting the number of k-mers composing each query sample sequence that matched a discriminating k-mer from the dictionary provided a simple criterion for assignment to target species and evaluation of the entire sample. Analyses of a wide range of samples, representative of both target and other drosophilid species, demonstrated very good performance of the proposed approach, both in terms of run time and accuracy of sequence assignment. Of the 236 D. suzukii individuals, five were reassigned to D. simulans and eleven to D. subpulchrella. Another four showed moderate to substantial microbial contamination. Similarly, among the 22 Pool-Seq samples analyzed, two from the native range were found to be contaminated with 1 and 7 D. subpulchrella individuals, respectively (out of 50), and one from Europe was found to be contaminated with 5 to 6 D. immigrans individuals (out of 100). Overall, the present analysis allowed the definition of a large curated dataset consisting of >60 population samples representative of the worldwide genetic diversity, which may be valuable for further population genetics studies on D. suzukii. More generally, while we advocate careful sample identification and verification prior to sequencing, the proposed framework is simple and computationally efficient enough to be included as a routine post-hoc quality check prior to any data analysis and prior to data submission to public repositories.

data curation, kmer, Drosophila suzukii, Pool-Seq, Ind-Seq

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

التنظيم الفعال القائم على k-mer لبيانات التسلسل الخام: التطبيق في Drosophila suzukii

سلطت العديد من الدراسات الضوء على وجود إدخالات ملوثة في مستودعات التسلسل العامة، مما يستدعي اهتمامًا خاصًا بالبيانات التعريفية المرتبطة بها. هنا، نقترح ونقيم نهجًا سريعًا وفعالًا يعتمد على الكيلومتر لتقييم درجة التسمية الخاطئة أو التلوث. قمنا بتطبيقها على بيانات التسلسل الخام للجينوم الكامل عالية الإنتاجية لـ 236 عينة Ind-Seq و 22 عينة Pool-Seq من الأنواع الغازية Drosophila suzukii. استخدمنا أولاً برنامج CLARK لبناء قاموس للكميرات التي تميز الأنواع من التجمعات المنسقة لـ 29 نوعًا مستهدفًا من ذبابة الفاكهة (بما في ذلك D. melanogaster أو D. simulans أو D. subpulchrella أو D. biarmipes) و12 مسببات أمراض ذبابة الفاكهة الشائعة والمتعايشة (بما في ذلك الولبخية). إن حساب عدد k-mers الذي يتكون من كل تسلسل لعينة الاستعلام والذي يطابق k-mer المميز من القاموس يوفر معيارًا بسيطًا للتخصيص للأنواع المستهدفة وتقييم العينة بأكملها. أظهرت تحليلات مجموعة واسعة من العينات، التي تمثل كلا من الأنواع المستهدفة والأنواع الأخرى من ذبابة الفاكهة، أداءً جيدًا للغاية للنهج المقترح، سواء من حيث وقت التشغيل أو دقة تعيين التسلسل. من بين 236 فردًا من أفراد D. suzukii، تم إعادة تعيين خمسة منهم إلى D. simulans وأحد عشر إلى D. subpulchrella. وأظهرت أربعة أخرى تلوثًا ميكروبيًا متوسطًا إلى كبير. وبالمثل، من بين 22 عينة من Pool-Seq التي تم تحليلها، وجد أن اثنتين من النطاق الأصلي ملوثتان بـ 1 و 7 أفراد من D. subpulchrella، على التوالي (من أصل 50)، وواحدة من أوروبا ملوثة بـ 5 إلى 6 د- الأفراد المهاجرون (من أصل 100). بشكل عام، سمح التحليل الحالي بتعريف مجموعة بيانات كبيرة منسقة تتكون من أكثر من 60 عينة سكانية تمثل التنوع الجيني في جميع أنحاء العالم، والتي قد تكون ذات قيمة لمزيد من دراسات الوراثة السكانية على D. suzukii. بشكل أكثر عمومية، في حين أننا ندعو إلى تحديد العينة بعناية والتحقق منها قبل التسلسل، فإن الإطار المقترح بسيط وفعال من الناحية الحسابية بما يكفي لإدراجه كفحص روتيني للجودة اللاحقة قبل أي تحليل للبيانات وقبل تقديم البيانات إلى المستودعات العامة.< / ع>

تنظيم البيانات، kmer، Drosophila suzukii، Pool-Seq، Ind-Seq

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Curación eficiente basada en k-mer de datos de secuencia sin procesar: aplicación en Drosophila suzukii

Varios estudios han destacado la presencia de entradas contaminadas en repositorios públicos de secuencias, lo que exige especial atención a los metadatos asociados. Aquí, proponemos y evaluamos un enfoque rápido y eficiente basado en kmer para evaluar el grado de etiquetado incorrecto o contaminación. Lo aplicamos a datos de secuencia sin procesar del genoma completo de alto rendimiento para 236 muestras Ind-Seq y 22 Pool-Seq de la especie invasora Drosophila suzukii. Primero utilizamos el software CLARK para crear un diccionario de kmers que discriminan especies a partir de conjuntos seleccionados de 29 especies de drosophilid objetivo (incluidas D. melanogaster, D. simulans, D. subpulchrella o D. biarmipes) y 12 patógenos y comensales comunes de drosophila (incluidos Wolbachia). Contar el número de k-meros que componen cada secuencia de muestra de consulta que coincidía con un k-mero discriminante del diccionario proporcionó un criterio simple para la asignación a las especies objetivo y la evaluación de toda la muestra. Los análisis de una amplia gama de muestras, representativas tanto del objetivo como de otras especies de drosofílidos, demostraron un muy buen rendimiento del enfoque propuesto, tanto en términos de tiempo de ejecución como de precisión de la asignación de secuencias. De los 236 individuos de D. suzukii, cinco fueron reasignados a D. simulans y once a D. subpulchrella. Otros cuatro mostraron contaminación microbiana de moderada a sustancial. De manera similar, entre las 22 muestras de Pool-Seq analizadas, se encontró que dos del área nativa estaban contaminadas con 1 y 7 individuos de D. subpulchrella, respectivamente (de 50), y una de Europa estaba contaminada con 5 a 6. D. individuos inmigrantes (de 100). En general, el presente análisis permitió la definición de un gran conjunto de datos seleccionados que consta de>60 muestras de población representativas de la diversidad genética mundial, que puede ser valioso para futuros estudios de genética de poblaciones sobre D. suzukii. De manera más general, si bien recomendamos una identificación y verificación cuidadosas de las muestras antes de la secuenciación, el marco propuesto es lo suficientemente simple y computacionalmente eficiente como para incluirlo como un control de calidad post-hoc de rutina antes de cualquier análisis de datos y antes del envío de datos a repositorios públicos. /p>

curación de datos, kmer, Drosophila suzukii, Pool-Seq, Ind-Seq

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Conservation efficace des données de séquence brutes basée sur le k-mer : application dans Drosophila suzukii

Plusieurs études ont mis en évidence la présence d'entrées contaminées dans les référentiels de séquences publics, appelant à une attention particulière aux métadonnées associées. Ici, nous proposons et évaluons une approche rapide et efficace basée sur le kmer pour évaluer le degré d’étiquetage erroné ou de contamination. Nous l'avons appliqué à des données de séquence brutes à haut débit du génome entier pour 236 échantillons Ind-Seq et 22 Pool-Seq de l'espèce envahissante Drosophila suzukii. Nous avons d'abord utilisé le logiciel CLARK pour créer un dictionnaire de kmers discriminants d'espèces à partir d'assemblages sélectionnés de 29 espèces de drosophiles cibles (dont D. melanogaster, D. simulans, D. subpulchrella ou D. biarmipes) et de 12 agents pathogènes et commensaux communs de drosophile (dont Wolbachia). Compter le nombre de k-mers composant chaque séquence d'échantillons de requête correspondant à un k-mer discriminant du dictionnaire a fourni un critère simple pour l'attribution aux espèces cibles et l'évaluation de l'échantillon entier. Les analyses d'un large éventail d'échantillons, représentatifs à la fois de la cible et d'autres espèces de drosophiles, ont démontré de très bonnes performances de l'approche proposée, à la fois en termes de durée d'exécution et de précision de l'attribution des séquences. Sur les 236 individus de D. suzukii, cinq ont été réaffectés à D. simulans et onze à D. subpulchrella. Quatre autres présentaient une contamination microbienne modérée à importante. De même, parmi les 22 échantillons Pool-Seq analysés, deux de l'aire de répartition naturelle se sont révélés contaminés respectivement par 1 et 7 individus de D. subpulchrella (sur 50), et un en provenance d'Europe s'est avéré contaminé par 5 à 6 individus. D. individus immigrés (sur 100). Dans l'ensemble, la présente analyse a permis la définition d'un vaste ensemble de données constitué de plus de 60 échantillons de population représentatifs de la diversité génétique mondiale, ce qui pourrait être utile pour d'autres études de génétique des populations sur D. suzukii. Plus généralement, même si nous préconisons une identification et une vérification minutieuses des échantillons avant le séquençage, le cadre proposé est suffisamment simple et informatiquement efficace pour être inclus en tant que contrôle de qualité post-hoc de routine avant toute analyse de données et avant la soumission des données aux référentiels publics. /p>

conservation des données, kmer, Drosophila suzukii, Pool-Seq, Ind-Seq

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

कच्चे अनुक्रम डेटा का कुशल के-मेर आधारित क्यूरेशन: ड्रोसोफिला सुजुकी में अनुप्रयोग

कई अध्ययनों ने सार्वजनिक अनुक्रम रिपॉजिटरी में दूषित प्रविष्टियों की उपस्थिति पर प्रकाश डाला है, जिससे संबंधित मेटाडेटा पर विशेष ध्यान देने की मांग की गई है। यहां, हम गलत लेबलिंग या संदूषण की डिग्री का आकलन करने के लिए एक तेज़ और कुशल किमीर-आधारित दृष्टिकोण का प्रस्ताव और मूल्यांकन करते हैं। हमने इसे आक्रामक प्रजाति ड्रोसोफिला सुजुकी के 236 इंड-सेक और 22 पूल-सेक नमूनों के लिए उच्च-थ्रूपुट पूरे-जीनोम कच्चे अनुक्रम डेटा पर लागू किया। हमने सबसे पहले 29 लक्ष्य ड्रोसोफिलिड प्रजातियों (डी. मेलानोगास्टर, डी. सिमुलंस, डी. सबपुलक्रेला या डी. बायर्मिप्स सहित) और 12 सामान्य ड्रोसोफिला रोगजनकों और कमेंसल्स (सहित) की क्यूरेटेड असेंबली से प्रजाति-भेदभाव वाले किमीर्स का एक शब्दकोश बनाने के लिए क्लार्क सॉफ्टवेयर का उपयोग किया। वोल्बाचिया)। प्रत्येक क्वेरी नमूना अनुक्रम की रचना करने वाले k-mer की संख्या की गणना करना जो शब्दकोश से एक भेदभावपूर्ण k-mer से मेल खाता है, लक्ष्य प्रजातियों को असाइनमेंट और पूरे नमूने के मूल्यांकन के लिए एक सरल मानदंड प्रदान करता है। नमूनों की एक विस्तृत श्रृंखला के विश्लेषण, लक्ष्य और अन्य ड्रोसोफिलिड प्रजातियों दोनों के प्रतिनिधि, ने रन टाइम और अनुक्रम असाइनमेंट की सटीकता दोनों के संदर्भ में प्रस्तावित दृष्टिकोण का बहुत अच्छा प्रदर्शन दिखाया। 236 डी. सुजुकी व्यक्तियों में से पांच को डी. सिमुलन्स और ग्यारह को डी. सबपुलक्रेला को पुनः नियुक्त किया गया। अन्य चार में मध्यम से पर्याप्त माइक्रोबियल संदूषण देखा गया। इसी प्रकार, विश्लेषण किए गए 22 पूल-सेक नमूनों में से, मूल श्रेणी के दो को क्रमशः 1 और 7 डी. सबपुलक्रेला व्यक्तियों से दूषित पाया गया (50 में से), और यूरोप से एक को 5 से 6 से दूषित पाया गया। डी. अप्रवासी व्यक्ति (100 में से)। कुल मिलाकर, वर्तमान विश्लेषण ने एक बड़े क्यूरेटेड डेटासेट की परिभाषा की अनुमति दी, जिसमें विश्वव्यापी आनुवंशिक विविधता के प्रतिनिधि 60 जनसंख्या नमूने शामिल हैं, जो डी. सुज़ुकी पर आगे जनसंख्या आनुवंशिकी अध्ययन के लिए मूल्यवान हो सकते हैं। अधिक आम तौर पर, जबकि हम अनुक्रमण से पहले सावधानीपूर्वक नमूना पहचान और सत्यापन की वकालत करते हैं, प्रस्तावित ढांचा सरल और कम्प्यूटेशनल रूप से इतना कुशल है कि इसे किसी भी डेटा विश्लेषण से पहले और सार्वजनिक रिपॉजिटरी में डेटा जमा करने से पहले नियमित पोस्ट-हॉक गुणवत्ता जांच के रूप में शामिल किया जा सकता है।< /पी>

डेटा क्यूरेशन, किमीर, ड्रोसोफिला सुजुकी, पूल-सेक, इंड-सेक

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

k-mer ベースの生配列データの効率的なキュレーション: ショウジョウバエスズキイへの応用

いくつかの研究では、公開配列リポジトリに汚染されたエントリが存在することが明らかになり、関連するメタデータに特別な注意を払う必要があります。ここでは、誤表示や汚染の程度を評価するための、高速かつ効率的な kmer ベースのアプローチを提案し、評価します。私たちはこれを、外来種ショウジョウバエスズキイの 236 個の Ind-Seq サンプルと 22 個の Pool-Seq サンプルのハイスループット全ゲノム生配列データに適用しました。まず CLARK ソフトウェアを使用して、29 の対象ショウジョウバエ種 (D. melanogaster、D. simulans、D. subpulchrella または D. biarmipes を含む) と 12 の一般的なショウジョウバエの病原体および共生生物 (以下を含む) の厳選されたアセンブリから種を識別する kmers の辞書を構築しました。ボルバキア）。辞書から識別する k-mer に一致する各クエリサンプルシーケンスを構成する k-mer の数をカウントすることで、標的種への割り当てとサンプル全体の評価のための簡単な基準が提供されました。標的種と他のショウジョウバエ種の両方を代表する幅広いサンプルの分析により、実行時間と配列割り当ての精度の両方の点で、提案されたアプローチの非常に優れたパフォーマンスが実証されました。 236 匹の D. suzukii 個体のうち、5 匹が D. simulans に、11 匹が D. subpulchrella に再割り当てされました。別の 4 つは中程度からかなりの微生物汚染を示しました。同様に、分析された 22 個の Pool-Seq サンプルのうち、在来種の 2 個では (50 個中) それぞれ 1 個と 7 個の D. subpulchrella 個体で汚染されていることが判明し、ヨーロッパからの 1 個では 5 ～ 6 個の D. subpulchrella 個体で汚染されていることが判明しました。 D. 移民の個人 (100 人中)。全体として、今回の分析により、世界中の遺伝的多様性を代表する 60 を超える集団サンプルからなる大規模な精選されたデータセットの定義が可能になり、これは D. スズキイに関するさらなる集団遺伝学研究にとって価値がある可能性があります。より一般的には、我々は配列決定の前に慎重なサンプルの同定と検証を推奨していますが、提案されたフレームワークはシンプルで計算効率が高く、データ分析前およびパブリックリポジトリへのデータ提出前のルーチンの事後品質チェックとして組み込むことができます。< /p>

データキュレーション、kmer、ショウジョウバエスズキイ、Pool-Seq、Ind-Seq

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Curadoria eficiente de dados de sequência bruta baseada em k-mer: aplicação em Drosophila suzukii

Vários estudos destacaram a presença de entradas contaminadas em repositórios de sequências públicas, exigindo atenção especial aos metadados associados. Aqui, propomos e avaliamos uma abordagem rápida e eficiente baseada em kmer para avaliar o grau de rotulagem incorreta ou contaminação. Nós o aplicamos a dados de sequência bruta de todo o genoma de alto rendimento para 236 amostras Ind-Seq e 22 Pool-Seq da espécie invasora Drosophila suzukii. Usamos primeiro o software CLARK para construir um dicionário de kmers discriminadores de espécies a partir de conjuntos selecionados de 29 espécies-alvo de drosófilas (incluindo D. melanogaster, D. simulans, D. subpulchrella ou D. biarmipes) e 12 patógenos e comensais comuns de drosófila (incluindo Wolbachia). A contagem do número de k-mers que compõem cada sequência de amostra de consulta que corresponde a um k-mer discriminante do dicionário forneceu um critério simples para atribuição a espécies-alvo e avaliação de toda a amostra. Análises de uma ampla gama de amostras, representativas de espécies alvo e de outras espécies de drosófilos, demonstraram um desempenho muito bom da abordagem proposta, tanto em termos de tempo de execução quanto de precisão na atribuição de sequências. Dos 236 indivíduos de D. suzukii, cinco foram transferidos para D. simulans e onze para D. subpulchrella. Outros quatro apresentaram contaminação microbiana moderada a substancial. Da mesma forma, entre as 22 amostras de Pool-Seq analisadas, descobriu-se que duas da área nativa estavam contaminadas com 1 e 7 indivíduos de D. subpulchrella, respectivamente (de 50), e uma da Europa estava contaminada com 5 a 6 D. indivíduos imigrantes (em 100). No geral, a presente análise permitiu a definição de um grande conjunto de dados curados consistindo de> 60 amostras populacionais representativas da diversidade genética mundial, o que pode ser valioso para futuros estudos de genética populacional em D. suzukii. De forma mais geral, embora defendamos a identificação e verificação cuidadosa da amostra antes do sequenciamento, a estrutura proposta é simples e computacionalmente eficiente o suficiente para ser incluída como uma verificação de qualidade post-hoc de rotina antes de qualquer análise de dados e antes do envio dos dados a repositórios públicos.< /p>

curadoria de dados, kmer, Drosophila suzukii, Pool-Seq, Ind-Seq

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Эффективное курирование необработанных данных последовательностей на основе k-меров: применение у Drosophila suzukii

Некоторые исследования выявили наличие зараженных записей в общедоступных хранилищах последовательностей, что требует особого внимания к соответствующим метаданным. Здесь мы предлагаем и оцениваем быстрый и эффективный подход на основе кмера для оценки степени неправильной маркировки или загрязнения. Мы применили его к высокопроизводительным данным необработанных полногеномных последовательностей для 236 образцов Ind-Seq и 22 Pool-Seq инвазивных видов Drosophila suzukii. Впервые мы использовали программное обеспечение CLARK для создания словаря видодискриминирующих кмеров из кураторских групп 29 целевых видов дрозофилид (включая D. melanogaster, D. simulans, D. subpulchrella или D. biarmipes) и 12 распространенных патогенов и комменсалов дрозофил (в том числе Вольбахия). Подсчет количества k-меров, составляющих каждую последовательность выборки запроса, которая соответствует различяющему k-меру из словаря, предоставил простой критерий для отнесения к целевым видам и оценки всей выборки. Анализ широкого спектра образцов, репрезентативных как для целевых, так и для других видов дрозофилид, продемонстрировал очень хорошую эффективность предложенного подхода как с точки зрения времени анализа, так и с точки зрения точности определения последовательности. Из 236 особей D. suzukii пять были отнесены к D. simulans, а одиннадцать - к D. subpulchrella. Еще в четырех было выявлено умеренное или существенное микробное загрязнение. Аналогичным образом, среди 22 проанализированных образцов Pool-Seq два из естественного ареала были контаминированы 1 и 7 особями D. subpulchrella соответственно (из 50), а один из Европы был контаминирован 5–6 особями D. subpulchrella соответственно. Особи D. immigrans (из 100). В целом, настоящий анализ позволил определить большой тщательно подобранный набор данных, состоящий из >60 образцов популяций, репрезентативных для мирового генетического разнообразия, что может быть ценным для дальнейших исследований популяционной генетики D. suzukii. В более общем плане, хотя мы выступаем за тщательную идентификацию и проверку образцов перед секвенированием, предлагаемая схема проста и достаточно эффективна в вычислительном отношении, чтобы ее можно было включить в качестве рутинной последующей проверки качества перед любым анализом данных и перед отправкой данных в общедоступные репозитории.< /п>

курирование данных, kmer, Drosophila suzukii, Pool-Seq, Ind-Seq

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

基于有效 k 聚体的原始序列数据管理：在 Drosophila suzukii 中的应用

几项研究强调了公共序列存储库中存在受污染条目，呼吁特别关注相关元数据。在这里，我们提出并评估了一种快速有效的基于 kmer 的方法来评估错误标签或污染的程度。我们将其应用于入侵物种铃木果蝇的 236 个 Ind-Seq 和 22 个 Pool-Seq 样本的高通量全基因组原始序列数据。我们首先使用 CLARK 软件从 29 种目标果蝇物种（包括 D. melanogaster、D. simulans、D. subpulchrella 或 D. biarmipes）和 12 种常见果蝇病原体和共生体（包括沃尔巴克氏体）。计算构成与字典中的判别性 k 聚体相匹配的每个查询样本序列的 k 聚体数量，为分配目标物种和评估整个样本提供了一个简单的标准。对代表目标果蝇物种和其他果蝇物种的广泛样本的分析表明，所提出的方法在运行时间和序列分配的准确性方面都具有非常好的性能。在 236 个 D. suzukii 个体中，5 个被重新分配给 D. simulans，11 个被重新分配给 D. subpulchrella。另外四个显示出中度至严重的微生物污染。同样，在分析的 22 个 Pool-Seq 样本中，发现来自本地范围的两个样本分别受到 1 和 7 个 D. subpulchrella 个体的污染（共 50 个），来自欧洲的一个样本被发现受到 5 到 6 个 D. subpulchrella 个体的污染。 D. 移民个人（每 100 人）。总体而言，本分析允许定义由代表全球遗传多样性的超过 60 个群体样本组成的大型精选数据集，这对于铃木 D. suzukii 的进一步群体遗传学研究可能是有价值的。更一般地说，虽然我们主张在测序之前仔细识别和验证样本，但所提出的框架足够简单且计算效率高，可以作为任何数据分析之前和数据提交到公共存储库之前的例行事后质量检查。< /p>

数据管理、kmer、Drosophila suzukii、Pool-Seq、Ind-Seq

Submission: posted 20 April 2023, validated 27 April 2023
Recommendation: posted 03 August 2023, validated 09 August 2023

Cite this recommendation as:
Galtier, N. (2023) Decontaminating reads, not contigs. Peer Community in Genomics, 100244. https://doi.org/10.24072/pci.genomics.100244

Recommendation

Contamination, the presence of foreign DNA sequences in a sample of interest, is currently a major problem in genomics. Because contamination is often unavoidable at the experimental stage, it is increasingly recognized that the processing of high-throughput sequencing data must include a decontamination step. This is usually performed after the many sequence reads have been assembled into a relatively small number of contigs. Dubious contigs are then discarded based on their composition (e.g. GC-content) or because they are highly similar to a known piece of DNA from a foreign species.

Here [1], Mathieu Gautier explores a novel strategy consisting in decontaminating reads, not contigs. Why is this promising? Assembly programs and algorithms are complex, and it is not easy to predict, or monitor, how they handle contaminant reads. Ideally, contaminant reads will be assembled into obvious contaminant contigs. However, there might be more complex situations, such as chimeric contigs with alternating genuine and contaminant segments. Decontaminating at the read level, if possible, should eliminate such unfavorable situations where sequence information from contaminant and target samples are intimately intertwined by an assembler.

To achieve this aim, Gautier proposes to use methods initially designed for the analysis of metagenomic data. This is pertinent since the decontamination process involves considering a sample as a mixture of different sources of DNA. The programs used here, CLARK and CLARK-L, are based on so-called k-mer analysis, meaning that the similarity between a read to annotate and a reference sequence is measured by how many sub-sequences (of length 31 base pairs for CLARK and 27 base pairs for CLARK-L) they share. This is notoriously more efficient than traditional sequence alignment algorithms when it comes to comparing a very large number of (most often unrelated) sequences. This is, therefore, a reference-based approach, in which the reads from a sample are assigned to previously sequenced genomes based on k-mer content.

This original approach is here specifically applied to the case of Drosophila suzukii, an invasive pest damaging fruit production in Europe and America. Fortunately, Drosophila is a genus of insects with abundant genomic resources, including high-quality reference genomes in dozens of species. Having calibrated and validated his pipeline using data sets of known origins, Gautier quantifies in each of 258 presumed D. suzukii samples the proportion of reads that likely belong to other species of fruit flies, or to fruit fly-associated microbes. This proportion is close to one in 16 samples, which clearly correspond to mis-labelled individuals. It is non-negligible in another ~10 samples, which really correspond to D. suzukii individuals. Most of these reads of unexpected origin are contaminants and should be filtered out. Interestingly, one D. suzukii sample contains a substantial proportion of reads from the closely related D. subpulchera, which might instead reflect a recent episode of gene flow between these two species. The approach, therefore, not only serves as a crucial technical step, but also has the potential to reveal biological processes.

Gautier's thorough, well-documented work will clearly benefit the ongoing and future research on D. suzuki, and Drosophila genomics in general. The author and reviewers rightfully note that, like any reference-based approach, this method is heavily dependent on the availability and quality of reference genomes - Drosophila being a favorable case. Building the reference database is a key step, and the interpretation of the output can only be made in the light of its content and gaps, as illustrated by Gautier's careful and detailed discussion of his numerous results.

This pioneering study is a striking demonstration of the potential of metagenomic methods for the decontamination of high-throughput sequence data at the read level. The pipeline requires remarkably few computing resources, ensuring low carbon emission. I am looking forward to seeing it applied to a wide range of taxa and samples.

Reference

[1] Gautier Mathieu. Efficient k-mer based curation of raw sequence data: application in Drosophila suzukii. bioRxiv, 2023.04.18.537389, ver. 2, peer-reviewed and recommended by Peer Community in Genomics. https://doi.org/10.1101/2023.04.18.537389

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
The authors declare that they have received no specific funding for this study

Reviews

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2023.04.18.537389

Version of the preprint: 1

Author's Reply, 19 Jul 2023

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.genomics.100244.ar1

Decision by Nicolas Galtier, posted 22 Jun 2023, validated 23 Jun 2023

This study introduces a novel approach for assessing and treating the problems of contamination and mislabeling in high-throughput genomic data. The idea is to recycle methods developed for species identification based on metagenomic data. The manuscript was reviewed by two colleagues, both of which are very positive - and so am I. A number of relevant suggestions were made, which I think should help improve the manuscript.

I have an aditional comment, also briefly mentioned by one reviewer. Besides contamination, there might be biological reasons why a given sample contains sequence reads assigned to a different species, namely hybridization and gene flow. How is the newly introduced method expected to behave when reproductive isolation between the analyzed species is incomplete? In particular, is there a risk that the method partially erases the signal of gene flow, if actually present? I think these questions could deserve a specific discussion as gene flow is quite common in nature and the focus of many population genomic studies.

I would be happy to consider a revised version of the manuscript for possible recommendation.

https://doi.org/10.24072/pci.genomics.100244.d1

Reviewed by Marie Cariou, 23 May 2023

This article describes a procedure used to control publicly available sequence data of Drosophila Suzukii for mislabeling and contaminations. The procedure relies on the construction of discriminatory k-mers dictionaries to compare with k-mers present in each dataset. It was performed using the software CLARK, which was created for the taxonomic classification of metagenomic sequences.

The procedure efficiently identified 16 mislabeled samples among the 236 individual D. suzukii sequence data and 2 contaminated samples among 22 pool-seq sequence data.

I found this approach really interesting and well presented in the manuscript.

The author 1) advocates for the routine inclusion of such k-mer based quality check in data quality assessment practices. 2) presents a curated dataset of D. suzukii public sequences, useful for further population genomics studies.

I may have a question regarding the idea that such check should be included in standard quality assessment. In this analysis, the author relied on extensive and curated assemblies genomic data (« high quality assemblies for several dozen of drosophilid genomes »). Here, these numerous genomes also allow to evaluate the "global" efficiency of the approach, but I wonder to what extend such approach could be easily generalized for any species. What would be the author guidelines to perform such check for any genomic dataset ? To say it differently, what would be the minimal external data (in terms of both quality of assembly and taxonomic coverage) required to construct a meaningful dictionary ?

L47-51 the repetition of « the resulting combined datasets » might be avoided.

L237. I think « 305 » should be « 301 », to match the sum listed in the paragraph (43+236 +22), which is also coherent with the number of lines in table S2 and S3 and to the value L331.

Sorry if I missed the correct sens of the number.

Fig 2B . Are the colors corresponding to target and other (light and dark blue) reversed? I expected the more dispersed and almost bimodal distribution (dark blue), with higher percentage of sequences with no match to correspond to the « other species ».

L314-316 Does this option -s 2 have a strong impact on computation time and fraction of sequences with no matching k-mers? ?

l410 « may thus [be] display »

I was able to retrieve the databases, cleaned assemblies and scripts from the Data INRAE repository but I did not attempted to run clark myself. However, they look well formatted and organized.

In “run_fastp_clarkl_clark_and_summarize_results.sh”: l20: “cleanning seqeunce” → “cleaning sequences”

https://doi.org/10.24072/pci.genomics.100244.rev11

Reviewed by Denis Baurain, 14 Jun 2023

In this empirical study on Drosophila whole genome samples, Gautier evaluates the use of the metagenomic classifier CLARK to analyse the contamination structure of short-read datasets by closely related species of the advertised organism and its microbial commensals. The author shows that this approach is both accurate and computationally efficient and, as a byproduct, releases a curated set of >60 population samples of D. suzuki that should be useful in future population genetic studies.

Generally speaking, I enjoyed reviewing this manuscript. The study is well-designed, the text is clear and pleasant to read and the figures are easy to understand. Moreover, the work is extremely well-documented, with most of the study details provided in Supplementary Tables, while data and scripts are made available in a public repository (please note that I did not download the latter to check the actual content). Consequently, my comments are minor and aimed at further clarifying the text when needed. However, I noticed a number of small errors in the reporting of the results. As some of them are quite confusing, I insist that they should be addressed in the revision of the manuscript.

* Scientific questions
- lines 173-175: I don't understand if the 101 assemblies of the paper (which are taxonomically diverse) are part of the 129 assemblies on the NCBI portal and, if not, why the former were not preferred to the latter? Was there some global quality assessment of all available assemblies (in the NCBI and elsewhere) prior to taking these decisions?
- lines 197-198 ("widespread lateral gene transfer from Wolbachia"): this raises the issue of whether such transfers should be considered as contamination in this species... and in other species! On a side note, had the species datasets completely devoid of Wolbachia sequences been aggressively curated before public release?
- lines 209 ("after filtering out contaminating sequences"): if I understand correctly, Kraken2 was used on whole contigs, not pseudo-reads spliced out of contigs. Then does "filtering out" mean removing these whole contigs (i.e., up to 1.4 Mb in one case)? Was it not possible to preserve more information by only masking the foreign regions of large contigs (assuming they might be chimeric)?
- lines 213-216: it is mentioned briefly in the Discussion (lines 766-769 and 793-795), but I wonder if "pangenomes" (rather than single strains) would have provided more sensitivity for pathogen and commensal screening. This is an important issue from a practical point of view.
- lines 243-244 ("including data on 12 of the 29 target species"): is it on purpose that 17 of the target species are not tested by the samples?
- in Table 1: I know that it is suggested in CLARK paper, but I wonder if the representation of some species by multiple assemblies is really harmless in terms of assignment statistics. Similarly, are we sure that the results are not biased in some way when some species are more distant and thus would provide a lot more specific k-mers than groups of highly related species? I did not find a discussion of this issue in CLARK paper, but for the present purposes, knowing the answer would be important. If so, it might be introduced at lines 298-300. Also, a related bit of discussion appears at lines 665-677.
- lines 478-494: for the 16 species not represented in the target dictionaries but still assigned to a single target species, 5 are assigned to D. bipectinata (and none to D. ananassae) and 2 to D. obscura (and none to D. subobscura). Is there a phylogenetic reason for this?

* Clarity issues
- lines 16-31 in the Abstract are a copy-paste from the end of the Introduction (lines 137-152); maybe rephrase some sentences?
- lines 57-58 ("the characteristics mentioned above have mostly remained"): I don't understand the idea here; please rephrase.
- lines 87-88 ("but they are not well suited for the analysis of large amounts of samples"): please add a hint about why it is so.
- line 91 ("the genomes of the putative contaminant species"): this is a bit restrictive (only negative filtering), especially considering that the current study use both positive and negative filtering; please add a bit of nuance. BTW, positive filtering is discussed at lines 700-704.
- lines 159-160: please explain the logic behind the phylogenetic breadth of the reference sampling to help others (e.g., why also the subgenus Drosophila).
- lines 161-163 ("for subgroups or groups represented by multiple assemblies, only one species was selected"): ambiguous phrasing: multiple assemblies of the same species or multiple assemblies of different species? In my view, one assembly does not always equate one species.
- line 186 ("including Wolbachia endosymbionts"): ambiguous wording; is it an exception or a precision?
- lines 230-232 ("Building the k–mer dictionary took 2h46min"): such timings are quite useless without some idea of the CPU architecture; please specify it.
- line 307 (and around): in CLARK paper, the confidence score is only computed based on the two top-matching sequences, not all; please check.
- lines 349-350 ("sequence length was representative of typical short read datasets"); please state that datasets here include a variable mixture of merged and unmerged reads (if I understand correctly).
- line 379 ("averaging 24.5%"): why to report a mean here and everywhere else median values? Is there a specific reason?
- Tables S4/S5 (and lines 414-415): "assignable (and assigned) sequences" should be better defined (see also my comment below for line 325). "% assigned sequences (with at least one matching kmer)" in head of Col E is confusing because either a) it should complement Col D "% seq with no matching kmer" [since a sequence either has zero or at least one matching k-mer (= assignable?)] or b) Col E actually reports the fraction of assignable sequences that are assigned (at >=5/6 and >=0.95 thresholds?). Please clarify.
- in legend of Figure 2 ("corresponding target dictionary"): why "corresponding" here? There is only one global dictionary per method, correct?
- lines 503-505 ("capture less than 30% of the assigned sequences"): the text does not exactly match what is shown in Figure S4 (rather 40% for Doshi, Dprui and Dbock while Dcard is not cited). Why such a discrepancy with Figure 3?
- lines 527-529: if I count correctly, 5 Ind-Seq samples are not mentioned in this part (236-215-16 = 5). Four of them are cited when discussing Wolbachia contamination, but not the last one: US-Nc2_CF1. Anything to say about it?
- lines 708-710 (about filtering based on k-mers): I agree with the assertion, but it seems ironic that target contigs were filtered with Kraken2 in the present study. It should be explicitly reminded here to avoid the feeling.
- lines 732-737 (about contaminated Pool-Seq samples): was this issue known prior to the current study? If not, this would be useful to state it.
- legend of Figure S2: why "Total assignment time"? I guess it includes sample loading time, but this is not mentioned in the main text. Is it what this means?

* Mild suggestions
- line 142 (and elsewhere): were assigned => were re-assigned [to emphasize the original assignment error?]
- lines 184,189-190: choose between "contaminating" and "contaminated"? In the present case, they are used interchangeably and this might be confusing.
- Figure 1: why two species names in bold? Besides, for consistency with, e.g., willistoni, I would add the subgroup guinaria and virilis in the figure (especially because "subgroup virilis" is used in the text).
- lines 494-495: these samples => these 16 samples [for clarity and maybe it would be useful to color them differently in Figure S4]
- lines 499-500: the most represented species => the most closely related represented species [also check y axis in Figure S4].
- line 539: 1. 71% => 1.71%
- Figure S1B (y axis): %% overlapping => % overlapping

* Reporting errors
- line 6: 32 => 22
- line 114: n=8 => n=6
- in the Excel file, Tables S2 and S3 are reversed
- line 250: n=3 => n=4 [and "missing Illumina HiSeq X Ten (PE150) (n=1)"]
- line 261 ("all sequenced on an Illumina HiSeq4000 in PE150 mode") => except 30 samples sequenced in PE100
- Tables S2/S3: column headers for timing values are incorrect, which makes the section about run times extremely confusing; please check and fix! Moreover, the head of the column for overlapping values has the word "Non" in it, which (wrongly) suggests that these numbers are "non-overlapping reads" (see also line 281 in main text).
- line 325 (and elsewhere): I am not sure that it is an error, but to me, >1 and >5 mean "at least two" and "at least 6", respectively. Is it what is meant here? The issue is important because the section about the proportion of assigned sequences is difficult to understand with this doubt in mind (see comment above).
- Figure 2B: I am pretty sure that there is an error in the order of the first two violin plots. Target sp and Other sp are probably reversed because, as such, they neither match the text (lines 389-398) nor Figure 2A. Color key is right though. Please check and fix!

https://doi.org/10.24072/pci.genomics.100244.rev12

User comments

No user comments yet

or Register
Submit a preprint