Close printable page

Recommendation

A quick alternative method for resolving bacterial taxonomy using short identical DNA sequences in genomes or metagenomes

B. Jesse Shapiro based on reviews by Gavin Douglas and 1 anonymous reviewer

A recommendation of:

A rapid and simple method for assessing and representing genome sequence relatedness

M Briand, M Bouzid, G Hunault, M Legeay, M Fischer-Le Saux, M Barret (2020), bioRxiv, 569640, ver. 5 peer-reviewed and recommended by Peer Community In Genomics https://doi.org/10.1101/569640

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

A rapid and simple method for assessing and representing genome sequence relatedness

Coherent genomic groups are frequently used as a proxy for bacterial species delineation through computation of overall genome relatedness indices (OGRI). Average nucleotide identity (ANI) is a widely employed method for estimating relatedness between genomic sequences. However, pairwise comparisons of genome sequences based on ANI is relatively computationally intensive and therefore precludes analyses of large datasets composed of thousands of genome sequences. In this work we proposed a workflow to compute and visualize relationships between genomic sequences. A dataset containing more than 3,500 *Pseudomonas* genome sequences was successfully classified with an alternative OGRI based on *k*-mer counts in few hours with the same precision as ANI. A new visualization method based on zoomable circle packing was employed for assessing relationships among the 350 groups generated. Amendment of databases with these *Pseudomonas* groups greatly improved the classification of metagenomic read sets with *k*-mer-based classifier. The developed workflow was integrated in the user-friendly KI-S tool that is available at the following address: https://iris.angers.inra.fr/galaxypub-cfbp.

ANI, k -mers, genome sequence relatedness, similarity matrix representation, circle packing, Pseudomonas , metagenome

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

طريقة سريعة وبسيطة لتقييم وتمثيل ارتباط تسلسل الجينوم

تُستخدم المجموعات الجينومية المتماسكة بشكل متكرر كبديل لتحديد الأنواع البكتيرية من خلال حساب مؤشرات ارتباط الجينوم الشاملة (OGRI). متوسط هوية النوكليوتيدات (ANI) هو طريقة مستخدمة على نطاق واسع لتقدير العلاقة بين التسلسلات الجينومية. ومع ذلك، فإن المقارنات الزوجية لتسلسلات الجينوم بناءً على ANI تكون مكثفة حسابيًا نسبيًا وبالتالي تمنع تحليل مجموعات البيانات الكبيرة المكونة من آلاف تسلسلات الجينوم. في هذا العمل اقترحنا سير عمل لحساب وتصور العلاقات بين التسلسلات الجينومية. تم تصنيف مجموعة بيانات تحتوي على أكثر من 3500 تسلسل جينوم *Pseudomonas* بنجاح باستخدام OGRI بديل يعتمد على أعداد *k*-mer في ساعات قليلة بنفس دقة ANI. تم استخدام طريقة تصور جديدة تعتمد على التعبئة الدائرية القابلة للتكبير/التصغير لتقييم العلاقات بين المجموعات الـ 350 التي تم إنشاؤها. أدى تعديل قواعد البيانات باستخدام مجموعات *Pseudomonas* هذه إلى تحسين تصنيف مجموعات القراءة الميتاجينومية بشكل كبير باستخدام المصنف المستند إلى *k*-mer. تم دمج سير العمل الذي تم تطويره في أداة KI-S سهلة الاستخدام والمتوفرة على العنوان التالي: https://iris.angers.inra.fr/galaxypub-cfbp.

ANI، k-mers، ارتباط تسلسل الجينوم، تمثيل مصفوفة التشابه، التعبئة الدائرية، الزائفة، الميتاجينوم

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un método rápido y sencillo para evaluar y representar la relación de la secuencia del genoma

Los grupos genómicos coherentes se utilizan con frecuencia como indicador para la delimitación de especies bacterianas mediante el cálculo de índices generales de relación del genoma (OGRI). La identidad promedio de nucleótidos (ANI) es un método ampliamente utilizado para estimar la relación entre secuencias genómicas. Sin embargo, las comparaciones por pares de secuencias del genoma basadas en ANI requieren un uso computacional relativamente intensivo y, por lo tanto, excluyen el análisis de grandes conjuntos de datos compuestos por miles de secuencias del genoma. En este trabajo propusimos un flujo de trabajo para calcular y visualizar relaciones entre secuencias genómicas. Un conjunto de datos que contiene más de 3500 secuencias del genoma de *Pseudomonas* se clasificó con éxito con un OGRI alternativo basado en recuentos de *k*-meros en pocas horas con la misma precisión que el ANI. Se empleó un nuevo método de visualización basado en el empaquetado de círculos ampliables para evaluar las relaciones entre los 350 grupos generados. La modificación de las bases de datos con estos grupos de *Pseudomonas* mejoró enormemente la clasificación de conjuntos de lecturas metagenómicas con un clasificador basado en *k*-mer. El flujo de trabajo desarrollado se integró en la herramienta KI-S fácil de usar que está disponible en la siguiente dirección: https://iris.angers.inra.fr/galaxypub-cfbp.

ANI, k -mers, relación de secuencia del genoma, representación de matriz de similitud, empaquetamiento de círculos, Pseudomonas, metagenoma

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Une méthode rapide et simple pour évaluer et représenter la relation entre les séquences du génome

Les groupes génomiques cohérents sont fréquemment utilisés comme proxy pour la délimitation des espèces bactériennes grâce au calcul des indices globaux de parenté du génome (OGRI). L'identité moyenne des nucléotides (ANI) est une méthode largement utilisée pour estimer la parenté entre les séquences génomiques. Cependant, les comparaisons par paires de séquences génomiques basées sur l’ANI nécessitent des calculs relativement intensifs et excluent donc les analyses de grands ensembles de données composés de milliers de séquences génomiques. Dans ce travail, nous avons proposé un flux de travail pour calculer et visualiser les relations entre les séquences génomiques. Un ensemble de données contenant plus de 3 500 séquences génomiques de *Pseudomonas* a été classé avec succès avec un OGRI alternatif basé sur les décomptes de *k*-mer en quelques heures avec la même précision que l'ANI. Une nouvelle méthode de visualisation basée sur le regroupement de cercles zoomables a été utilisée pour évaluer les relations entre les 350 groupes générés. La modification des bases de données avec ces groupes *Pseudomonas* a grandement amélioré la classification des ensembles de lectures métagénomiques avec un classificateur basé sur *k*-mer. Le workflow développé a été intégré dans l'outil convivial KI-S disponible à l'adresse suivante : https://iris.angers.inra.fr/galaxypub-cfbp.

ANI, k -mers, parenté des séquences génomiques, représentation matricielle de similarité, empilement de cercles, Pseudomonas, métagénome

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

जीनोम अनुक्रम संबंधीता का आकलन और प्रतिनिधित्व करने के लिए एक तीव्र और सरल विधि

समग्र जीनोम संबंधित सूचकांक (ओजीआरआई) की गणना के माध्यम से सुसंगत जीनोमिक समूहों को अक्सर जीवाणु प्रजातियों के चित्रण के लिए प्रॉक्सी के रूप में उपयोग किया जाता है। औसत न्यूक्लियोटाइड पहचान (एएनआई) जीनोमिक अनुक्रमों के बीच संबंधितता का आकलन करने के लिए एक व्यापक रूप से नियोजित विधि है। हालाँकि, एएनआई पर आधारित जीनोम अनुक्रमों की जोड़ीवार तुलना अपेक्षाकृत कम्प्यूटेशनल रूप से गहन है और इसलिए हजारों जीनोम अनुक्रमों से बने बड़े डेटासेट के विश्लेषण को रोकती है। इस कार्य में हमने जीनोमिक अनुक्रमों के बीच संबंधों की गणना और कल्पना करने के लिए एक वर्कफ़्लो प्रस्तावित किया है। 3,500 से अधिक *स्यूडोमोनास* जीनोम अनुक्रमों वाले डेटासेट को एएनआई के समान सटीकता के साथ कुछ घंटों में *k*-मेर गणना के आधार पर वैकल्पिक ओजीआरआई के साथ सफलतापूर्वक वर्गीकृत किया गया था। उत्पन्न 350 समूहों के बीच संबंधों का आकलन करने के लिए ज़ूम करने योग्य सर्कल पैकिंग पर आधारित एक नई विज़ुअलाइज़ेशन पद्धति को नियोजित किया गया था। इन *स्यूडोमोनास* समूहों के साथ डेटाबेस में संशोधन से *k*-mer-आधारित क्लासिफायरियर के साथ मेटागेनोमिक रीड सेट के वर्गीकरण में काफी सुधार हुआ। विकसित वर्कफ़्लो को उपयोगकर्ता के अनुकूल KI-S टूल में एकीकृत किया गया था जो निम्नलिखित पते पर उपलब्ध है: https://iris.angers.inra.fr/galaxypub-cfbp।

एएनआई, के-मर्स, जीनोम अनुक्रम संबंधितता, समानता मैट्रिक्स प्रतिनिधित्व, सर्कल पैकिंग, स्यूडोमोनास, मेटागेनोम

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

ゲノム配列の関連性を評価および表現するための迅速かつ簡単な方法

一貫したゲノムグループは、全体的なゲノム関連性指数 (OGRI) の計算を通じて細菌種を描写するための代用としてよく使用されます。平均ヌクレオチド同一性 (ANI) は、ゲノム配列間の関連性を推定するために広く使用されている方法です。ただし、ANI に基づくゲノム配列のペアごとの比較は比較的計算量が多いため、数千のゲノム配列で構成される大規模なデータセットの分析は不可能になります。この研究では、ゲノム配列間の関係を計算して視覚化するワークフローを提案しました。 3,500 を超える *シュードモナス* ゲノム配列を含むデータセットは、*k*-mer 数に基づく代替 OGRI を使用して、ANI と同じ精度で数時間で正常に分類されました。生成された 350 のグループ間の関係を評価するために、ズーム可能なサークルパッキングに基づく新しい視覚化方法が採用されました。これらの *Pseudomonas* グループによるデータベースの修正により、*k*-mer ベースの分類器によるメタゲノムリードセットの分類が大幅に改善されました。開発されたワークフローは、次のアドレスで入手できる使いやすい KI-S ツールに統合されました: https://iris.angers.inra.fr/galaxypub-cfbp。

ANI、k-mer、ゲノム配列関連性、類似性行列表現、サークルパッキング、シュードモナス、メタゲノム

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Um método rápido e simples para avaliar e representar a relação da sequência do genoma

Grupos genômicos coerentes são frequentemente usados como proxy para o delineamento de espécies bacterianas através do cálculo de índices gerais de relacionamento do genoma (OGRI). A identidade média de nucleotídeos (ANI) é um método amplamente utilizado para estimar o parentesco entre sequências genômicas. No entanto, comparações pareadas de sequências genômicas baseadas em ANI são relativamente intensivas em termos computacionais e, portanto, impedem análises de grandes conjuntos de dados compostos por milhares de sequências genômicas. Neste trabalho propusemos um fluxo de trabalho para calcular e visualizar relações entre sequências genômicas. Um conjunto de dados contendo mais de 3.500 sequências do genoma de *Pseudomonas* foi classificado com sucesso com um OGRI alternativo baseado em contagens de *k*-mer em poucas horas com a mesma precisão do ANI. Um novo método de visualização baseado em empacotamento circular com zoom foi empregado para avaliar as relações entre os 350 grupos gerados. A alteração de bancos de dados com esses grupos *Pseudomonas* melhorou muito a classificação de conjuntos de leitura metagenômicos com classificador baseado em *k*-mer. O fluxo de trabalho desenvolvido foi integrado na ferramenta KI-S de fácil utilização que está disponível no seguinte endereço: https://iris.angers.inra.fr/galaxypub-cfbp.

ANI, k -mers, relação de sequência genômica, representação de matriz de similaridade, empacotamento circular, Pseudomonas, metagenoma

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Быстрый и простой метод оценки и представления родства последовательностей генома.

Связные геномные группы часто используются в качестве показателя для определения видов бактерий путем расчета общих индексов родства геномов (OGRI). Средняя идентичность нуклеотидов (ANI) — широко используемый метод оценки родства между геномными последовательностями. Однако парное сравнение последовательностей генома на основе ANI требует относительно больших вычислительных ресурсов и поэтому исключает анализ больших наборов данных, состоящих из тысяч последовательностей генома. В этой работе мы предложили рабочий процесс для расчета и визуализации связей между геномными последовательностями. Набор данных, содержащий более 3500 последовательностей генома *Pseudomonas*, был успешно классифицирован с помощью альтернативного OGRI на основе подсчета *k*-меров за несколько часов с той же точностью, что и ANI. Для оценки связей между 350 созданными группами был использован новый метод визуализации, основанный на масштабируемой упаковке кругов. Внесение в базы данных этих групп *Pseudomonas* значительно улучшило классификацию наборов метагеномных прочтений с помощью классификатора на основе *k*-меров. Разработанный рабочий процесс был интегрирован в удобный инструмент KI-S, доступный по следующему адресу: https://iris.angers.inra.fr/galaxypub-cfbp.

ANI, k-меры, родство последовательностей геномов, представление матрицы сходства, упаковка кругов, Pseudomonas, метагеном

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

一种快速、简单的评估和表示基因组序列相关性的方法

通过计算整体基因组相关性指数 (OGRI)，一致的基因组群经常被用作细菌物种描述的代理。平均核苷酸同一性（ANI）是一种广泛采用的估计基因组序列之间相关性的方法。然而，基于 ANI 的基因组序列的成对比较计算量相对较大，因此无法对由数千个基因组序列组成的大型数据集进行分析。在这项工作中，我们提出了一个计算和可视化基因组序列之间关系的工作流程。包含超过 3,500 个*假单胞菌*基因组序列的数据集已在几个小时内使用基于 *k*-mer 计数的替代 OGRI 成功分类，其精度与 ANI 相同。采用基于可缩放圆堆积的新可视化方法来评估生成的 350 个组之间的关系。使用这些*假单胞菌*组对数据库进行的修改极大地改进了基于*k*-mer的分类器对宏基因组读取集的分类。开发的工作流程已集成到用户友好的 KI-S 工具中，该工具可从以下地址获取：https://iris.angers.inra.fr/galaxypub-cfbp。

ANI、k-mers、基因组序列相关性、相似性矩阵表示、圆形堆积、假单胞菌、宏基因组

Submission: posted 07 November 2019
Recommendation: posted 15 September 2020, validated 24 September 2020

Cite this recommendation as:
Shapiro, B. (2020) A quick alternative method for resolving bacterial taxonomy using short identical DNA sequences in genomes or metagenomes. Peer Community in Genomics, 100001. https://doi.org/10.24072/pci.genomics.100001

Recommendation

The bacterial species problem can be summarized as follows: bacteria recombine too little, and yet too much (Shapiro 2019).
Too little in the sense that recombination is not obligately coupled with reproduction, as in sexual eukaryotes. So the Biological Species Concept (BSC) of reproductive isolation does not strictly apply to clonally reproducing organisms like bacteria. Too much in the sense that genetic exchange can occur promiscuously across species (or even Domains), potentially obscuring species boundaries.
In parallel to such theoretical considerations, several research groups have taken more pragmatic approaches to defining bacterial species based on sequence similarity cutoffs, such as genome-wide average nucleotide identity (ANI). At a cutoff above 95% ANI, genomes are considered to come from the same species. While this cutoff may appear arbitrary, a discontinuity around 95% in the distribution of ANI values has been argued to provide a 'natural' cutoff (Jain et al. 2018). This discontinuity has been criticized as being an artefact of various biases in genome databases (Murray, Gao, and Wu 2020), but appears to be a general feature of relatively unbiased metagenome-assembled genomes as well (Olm et al. 2020). The 95% cutoff has been suggested to represent a barrier to homologous recombination (Olm et al. 2020), although clusters of genetic exchange consistent with BSC-like species are observed at much finer identity cutoffs (Shapiro 2019; Arevalo et al. 2019).
Although 95% ANI is the most widely used genomic standard for species delimitation, it is by no means the only plausible approach. In particular, tracts of identical DNA provide evidence for recent genetic exchange, which in turn helps define BSC-like clusters of genomes (Arevalo et al. 2019). In this spirit, Briand et al. (2020) introduce a genome-clustering method based on the number of shared identical DNA sequences of length k (or k-mers). Using a test dataset of Pseudomonas genomes, they find that 95% ANI corresponds to approximately 50% of shared 15-mers. Applying this cutoff yields 350 Pseudomonas species, whereas the current taxonomy only includes 207 recognized species. To determine whether splitting the genus into a greater number of species is at all useful, they compare their new classification scheme to the traditional one in terms of the ability to taxonomically classify metagenomic sequencing reads from three Pseudomonas-rich environments. In all cases, the new scheme (termed K-IS for "Kinship relationships Identification with Shared k-mers") yielded a higher number of classified reads, with an average improvement of 1.4-fold. This is important because increasing the number of genome sequences in a reference database – without consistent taxonomic annotation of these genomes – paradoxically leads to fewer classified metagenomic reads. Thus a rapid, automated taxonomy such as the one proposed here offers an opportunity to more fully harness the information from metagenomes.
KI-S is also fast to run, so it is feasible to test several values of k and quickly visualize the clustering using an interactive, zoomable circle-packing display (that resembles a cross-section of densely packed, three-dimensional dendrogram). This interface allows the rapid flagging of misidentified species, or understudied species with few sequenced representatives as targets for future study. Hopefully these initial Pseudomonas results will inspire future studies to apply the method to additional taxa, and to further characterize the relationship between ANI and shared identical k-mers. Ultimately, I hope that such investigations will resolve the issue of whether or not there is a 'natural' discontinuity for bacterial species, and what evolutionary forces maintain this cutoff.

References

Arevalo P, VanInsberghe D, Elsherbini J, Gore J, Polz MF (2019) A Reverse Ecology Approach Based on a Biological Definition of Microbial Populations. Cell, 178, 820-834.e14. https://doi.org/10.1016/j.cell.2019.06.033

Briand M, Bouzid M, Hunault G, Legeay M, Saux MF-L, Barret M (2020) A rapid and simple method for assessing and representing genome sequence relatedness. bioRxiv, 569640, ver. 5 peer-reveiwed and recommended by PCI Genomics. https://doi.org/10.1101/569640

Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications, 9, 5114. https://doi.org/10.1038/s41467-018-07641-9

Murray CS, Gao Y, Wu M (2020) There is no evidence of a universal genetic boundary among microbial species. bioRxiv, 2020.07.27.223511. https://doi.org/10.1101/2020.07.27.223511

Olm MR, Crits-Christoph A, Diamond S, Lavy A, Carnevali PBM, Banfield JF (2020) Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries. mSystems, 5. https://doi.org/10.1128/mSystems.00731-19

Shapiro BJ (2019) What Microbial Population Genomics Has Taught Us About Speciation. In: Population Genomics: Microorganisms Population Genomics. (eds Polz MF, Rajora OP), pp. 31–47. Springer International Publishing, Cham. https://doi.org/10.1007/13836201810

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Reviewed by Gavin Douglas, 25 Aug 2020

The authors have addressed all of my comments - thanks.

https://doi.org/10.24072/pci.genomics.100001.rev31

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/569640

Version of the preprint: 3

Author's Reply, 21 Aug 2020

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.genomics.100006.ar2

Decision by B. Jesse Shapiro, posted 03 Aug 2020

Both reviewers appreciated the improvements made to this revised manuscripts, but suggested a few remaining revisions before the manuscript can be recommended. I trust you will be able to address these comments without too much additional work, since they mainly involve clarification and software availability.

https://doi.org/10.24072/pci.genomics.100006.d2

Reviewed by anonymous reviewer 1, 31 Jul 2020

Overall I believe that this manuscript has improved, and the authors are more fair in their comparison of modern methods. Other choices for benchmarking could have been chosen, but the authors have justified their tests.

My main remaining issue is on the availability of the software. While the source code is now available (after some effort, but available nonetheless), the galaxy wrapper is not. I would urge the authors to either open access to the galaxy instance linked, or put their code on the galaxy shed so that other galaxy users can most easily benefit from their work.

https://doi.org/10.24072/pci.genomics.100006.rev21

Reviewed by Gavin Douglas, 16 Jul 2020

The authors have addressed most of my concerns, but there are two important points I would like to see addressed to improve the manuscript.

Major comments:

First, thanks for expanding on the CLARK analysis in your response, but I still have a few concerns. Does the % classified reads that you refer to refer to just at the species level or in general? It would make sense if it corresponded to just the species level based on your explanation. In either case, a clearer discussion of this result is needed. From my understanding, this result is a proof-of-concept that clustering genomes into clusters before running taxonomic assignment can improve classification. It could be argued that this is circular because the clusters are based on shared k-mers, which is also what CLARK bases classification on: if taxa are defined based on shared k-mers then it would always be expected for a k-mer-based classification approach to classify taxa with more resolution. I think the authors should emphasize this k-mer connection between CLARK and the clustering approach and also clearly state that they can only hypothesize that a higher proportion of reads are being correctly classified with their workflow (and that it hasn't actually been demonstrated). Currently the authors discuss this result as showing a clear benefit to microbial ecology in general, which I think would be very misleading to readers.

Secondly, on page 10 the authors state: “Moreover, KI-S includes a friendly visualization interface that could help systematicians to curate whole genome databases.”. I was able to get access to the authors’ galaxy server to try out the tool thanks to their quick reply to my email and I found it straight-forward to use. However, it wasn’t clear to me whether any reader in general would be able to get an account on this server. Based on advertising the link in the manuscript I’m guessing this is true, but this should be clarified either way. If not, then users will need more documentation on how they can use the KI-S code to setup the visualization workflow themselves. I did not find it straight-forward to download the source code and it looks like the only documentation for the source code (the README.md file in the GitHub repository) is in French, which would need to be translated for an English-reading audience. Specifically, looking into this README it appears that the key circle packing visualization step is performed by the generate_packing.pl Perl script. Details on how to prepare the input and look at the output of this script is needed.

Minor comments:

Hierarchical clustering is mentioned later on, but it would help readers evaluate the method to know the specific details on how this clustering was performed with the custom R script when it is described in the methods.

Minor typo on Pg 4: “was first evaluate” should be “was first evaluated”

https://doi.org/10.24072/pci.genomics.100006.rev22

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/569640

Version of the preprint: 2

Author's Reply, 07 Jul 2020

Download author's reply https://doi.org/10.24072/pci.genomics.100006.ar1

Decision by B. Jesse Shapiro, posted 19 Dec 2019

The manuscript has now been seen by two reviewers, who both see potential in the work but both raised concerns about precisely what the new method brings, and how it compares to other methods (e.g. FastANI). Perhaps the major contribution of the new method lies in the visualization, in which case this part should be expanded. The reviewers also have several specific comments that should be addressed in a revised manuscript.

https://doi.org/10.24072/pci.genomics.100006.d1

Reviewed by anonymous reviewer 1, 25 Nov 2019

In this short manuscript Briand et al describe a workflow which uses k-mer indexing software to compare bacterial genomes. This method generates a similarity measure which is comparable to ANI. They go on to use these relatedness measures to cluster genomes at various thresholds, produce a visualisation of these clusters, and test the use of these clusters in metagenomic read classification. This workflow is deployed on a galaxy server.

Overall the methods in the manuscript appear to be sound, as they are mostly based on previously published work. Though the novelty of the algorithm is limited, the implementation and pipeline, being on galaxy, may well be useful to researchers who are more comfortable with a graphical user interface than the command line. This server requires username and password to use, so I was unable to test any of this software myself. Nor was the implementation available on github (or similar), or the galaxy shed, meaning no-one else can use it. This severely limited my ability to review this aspect of the manuscript.

I also had a number of serious issues with the presentation of the work: 1) The comparison with PYANI is not really appropriate. The authors used Simka, which by my understanding is a k-mer indexing package, so is unsurprisingly orders of magnitude faster than nucleotide alignment with mummer and blast. A more modern comparison would be with either other k-mer indexes, or sketch based approaches such as fastANI. These approaches have been around for a number of years, and are the standard now used in this field. 2) There is not enough description of the methods, and code is also needed. Describing briefly how components work, what they do and why parameters values were chosen all need to be added. I was not able to find information on simka without following references, and this is an integral part of the method. The difference in how simka and other potential methods work needs to be explained, and why this is expected to lead to large differences in computation time. Likewise, the section on metagenomic read sets needs further description. (What is Clark and how does it work? Why does adding further classifications in helps classify more reads?) 3) The results also lack context. It was difficult to understand what problems were being solved by the presented method, and how much of the method is new compared to e.g. fastANI. The first section of the results was mostly a technical methods description, finding similar results on k-mer size as has been previously reported. How did the original Clark database and the newly assigned genome sequences differ in classification, and why exactly did this change the number of reads that could be assigned? How does this relate to the broader issue of misclassification and missing identifiers in RefSeq, which has been noted previously (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1554-6)? More explicit example use cases could be added. In figure 3, describing how to read the figure and e.g. identify misclassiciations would be useful. 4) More care needs to be taken with some of the species and genus name terminology. Particularly, the words 'strain' and 'clique' kept appearing without definition. How do these terms relate to species and genus level differences? 5) Why was the Pseudomonas dataset used? What was the original species classifications breakdown, and how (quantitatively) did this compare with the reclassification? Is this one example sufficient? Other fast distance estimators have been run on all of RefSeq.

https://doi.org/10.24072/pci.genomics.100006.rev11

Reviewed by Gavin Douglas, 22 Nov 2019

Briand et al. describe a new approach for computing inter-genome relatedness based on the percentage of shared kmers. The main motivation for this project is that the computation time for computing many relatedness metrics, like the average nucleotide identity, can be prohibitive for many pairwise genome comparisons. I think this tool could be valuable for the field, especially after addressing a few issues which I think currently make the benefits of the tool difficult to evaluate (see below). In particular, I think the authors’ tool could be great for running quality control on taxonomy assignments in genome databases. This quality control can be run using an interactive approach for visualizing genome relatedness that the authors have implemented, which I think could be used for quickly spotting problematic taxonomic assignments.

Major comments

I think it would be important to clarify how the results of KI-S clustering in practice differ from other similar tools. One tool in particular is Mash, which was published in 2016. This tool can be used to rapidly compute distances between genomes after performing dimension reduction based on kmer counts. The motivation for Mash was to speed up calculations of inter-genome (and sequences in general) distances. In the Mash paper the authors describe their approach as comparable to ANI while being much faster and so I think it would be important to directly compare to Mash in terms of both the compute time and results. If the authors don’t agree that Mash is a comparable tool then this should be explained.

A related issue is that it currently is not clear whether using the % of shared kmers results in comparable genome clusters to existing approaches like average nucleotide identity. It seems like this would likely be the case, but I think this is important for the authors to clearly describe either way so that users can better evaluate the tool.

I also do not agree that they have shown evidence that their approach can be used to improve the taxonomic classification of metagenomics samples. This analysis focused on the percentage of classified reads, which cannot alone be used to evaluate how well a taxonomic classification performed.

Other comments

The way KI-S is mentioned in the abstract makes it seems like it is a pre-existing tool, but on page 5 it sounds like the authors developed it from scratch – this should be clarified. Also, it is unclear whether all the steps like running Simka and the custom R script are run by KI-S itself. Lastly, it would be good to state what KI-S stands for, which I may have missed.
P3,L45 – I recommend re-wording to make the first few sentences of the Background a little clearer. In particular, it reads like specifically Bacteria vs Archaea are the taxonomic groups being delineated, rather than prokaryotic species in general
Figure 1 – Axis labels are needed, which might be easiest to do if fewer panels were shown. In particular, it seems like K15-K20 are extremely similar so maybe a couple could be removed. It is also not clear to me from the figure legend what “the number of values by class in the subset of 934 Pseudomonas genomic comparison” refers to on the y-axis. I think this is the ANI / % shared kmers for every pairwise comparison of Pseudomonas genomes, but I think this could be clarified either way.
When describing the % overlapping species in each peak in figure 1 – how were the cut-offs for which data points to include in each peak decided (e.g. what cut-offs of % shared kmers were used to call data points in peak 2?)
P7 – The authors imply that using 15-mers is the best or at least equally good as higher kmer values. This decision is discussed in the discussion, but I think it would be useful to explicitly mention this decision here (esp. when contrasting the 15-mer and 20-mer comparisons for instance) – perhaps at the end of paragraph 1 of the results.
P7,L132 – I think the paragraph starting with “Fifty percent of 15-mers is close to ANIb value of 0.95” would benefit by making it clear what the goal of these analyses were, possibly with something like this: “We next investigated what percentage of shared kmers corresponds to an ANIb value of 0.95, which is a common cut-off for delineating species”.
P7,L145-147 – I am not sure what the sentence starting with “In addition, 15-mers allows the investigation of inter and intra-specific…” refers to and I think this should be clarified. One possible way to make the authors’ point clearer might be to contrast why they think this is true specifically for 15-mers and not the 10 or 20-mer distributions also shown in Fig 2.
P7,L149 – do these run times correspond to running the jobs on a single core? It would be useful to mention the memory usage as well if that’s possible.
P8,L158-159 – “185 cliques were composed of a single genome sequences, therefore highlighting the high Pseudomonas strain diversity” – an alternative explanation would be that KS-I is incorrectly calling those genomes as individual cliques. If there are species (and strain) names for all genomes then that would be one way to evaluate whether these genomes are expected to be in different cliques or not.
On a related note to the above it would be useful to compare the cliques identified based on KS-I compared to ANI-b – based on Fig 2 it looks like they would be extremely similar, but I do not think that is clear from the main text.
P8,L159 – I think using estimates of Chao1 alpha-diversity to estimate the expected number of Pseudomonas clusters would only make sense if you’re considering genome cliques in a single environment (and at a particular time). I do not think the numbers of singleton and doubleton genomes in NCBI can really tell you about how many more Pseudomonas genome clusters are out there in general, if only because many Pseudomonas habitats have not been sampled.
Fig 3 – I really like the zoomable circle packing representation of the data – this seems like a great way to summarize the relationships between many genomes. It is not clear to me how novel this visualization approach is, but if the authors believe that it is novel then I would emphasize that more in the introduction and discussion.
P8 – I am not familiar with the term “clique” – maybe “cluster” would be clearer?
P8 – It’s not clear to me why changing the taxonomic label of the Pseudomonas genomes added to the database results in a higher proportion of classified reads. Is this because the CLARK algorithm tends not to collapse taxonomy to higher ranks if reads map to genomes associated with different species? If so, that is surprising to me, but I am not sure why else there would be a different in the proportion of classified reads. It would be useful to briefly explain why the authors think this is occurring. Unless I am missing something I also do not think this would make a difference for most metagenomics taxonomic classifiers like centrifuge, kraken2, and MEGAN
P11,L221-223 – I think these are great examples for how this tool could be used to clean up and perform quality control on taxonomy assignments in genome databases
P11, L229 – end – As mentioned above I do not fully follow why more reads were classified with CLARK after changing the taxonomic labels of the Pseudomonas genomes, but either way I do not think this is evidence that the taxonomic classifier is actually working better as indicated in the concluding paragraph currently. I think some sort of validation would be needed to be able to state that first organizing the genomes into cliques actually improves taxonomic classification. This is difficult to do because we almost never know the right answer in microbiome datasets. However, one potential way to do this would be to create a simulate dataset enriched for Pseudomonas (ideally with metagenome-assembled genomes from the seed datasets) and then compare the relative abundances of the taxa inferred using the 3 approaches mentioned in Fig 4 to the expected relative abundances.

Example of grammatical errors

Lastly, there are numerous grammatical errors throughout the manuscript – I have made a non-exhaustive list of example errors and possible, which hopefully will be useful for the authors.

P2,L29-30: “…datasets composed of thousand genome sequences” change to “datasets composed of thousands of genome sequences”.
P2,L31 – “kmers counts” should be “kmer counts”
P3,L64 – “for one pair of genome sequence” should be “one pair of genome sequences”
P4,L71 – “classifiers differ in term” should be “classifiers differ in terms”
P4,L72-73 – “for affiliating read to a” should be “for affiliating a read to a taxonomic rank”
P4,L81 – add “the” before “relatedness”
P5,L100 – should be “were selected” instead of “was selected”
P6,L116 – need to add either “the” or “a” in front of “common bean” depending on which is correct
P7,L143 – “Fifty percent of 15-mers is close to ANIb value of 0.95” should be “Fifty percent of 15-mers are close to an ANIb value of 0.95”.
P7,L153 – “used to investigate relatedness” should be “used to investigate the relatedness”
P10,L188 – “prohibited its used for comparing” should be “prohibit its use for comparing”
P11,L216 - “based ANIb” should be “based on ANIb”
P11,L219 – “Moreover, KI-S tool, provides…” should be re-written, perhaps as “Moreover, KI-S includes…”

https://doi.org/10.24072/pci.genomics.100006.rev12