Close printable page

Recommendation

A workflow for studying enigmatic non-autonomous transposable elements across bacteria

Gavin Douglas based on reviews by Sophie Abby and 1 anonymous reviewer

A recommendation of:

RAREFAN: A webservice to identify REPINs and RAYTs in bacterial genomes

Frederic Bertels, Julia von Irmer, Carsten Fortmann-Grote (2023), bioRxiv, ver.4, peer-reviewed and recommended by PCI Genomics https://doi.org/10.1101/2022.05.22.493013

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

RAREFAN: A webservice to identify REPINs and RAYTs in bacterial genomes

Compared to eukaryotes, repetitive sequences are rare in bacterial genomes and usually do not persist for long. Yet, there is at least one class of persistent prokaryotic mobile genetic elements: REPINs. REPINs are non-autonomous transposable elements replicated by single-copy transposases called RAYTs. REPIN-RAYT systems are mostly vertically inherited and have persisted in individual bacterial lineages for millions of years. Discovering and analyzing REPIN populations and their corresponding RAYT transposases in bacterial species can be rather laborious, hampering progress in understanding REPIN-RAYT biology and evolution. Here we present RAREFAN, a webservice that identifies REPIN populations and their corresponding RAYT transposase in a given set of bacterial genomes. We demonstrate RAREFAN’s capabilities by analyzing a set of 49 Stenotrophomonas maltophilia genomes, containing nine different REPIN-RAYT systems. We guide the reader through the process of identifying and analyzing REPIN-RAYT systems across S. maltophilia , highlighting erroneous associations between REPIN and RAYTs, and providing solutions on how to find correct associations. RAREFAN enables rapid, large-scale detection of REPINs and RAYTs, and provides insight into the fascinating world of intragenomic sequence populations in bacterial genomes. RAREFAN is available at http://rarefan.evolbio.mpg.de.

REPINs, RAYTs, tranposable elements, Stenotrophomonas maltophilia, sequence populations

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

RAREFAN: خدمة ويب لتحديد REPINs و RAYTs في الجينومات البكتيرية

بالمقارنة مع حقيقيات النوى، فإن التسلسلات المتكررة نادرة في الجينومات البكتيرية وعادة لا تستمر لفترة طويلة. ومع ذلك، هناك فئة واحدة على الأقل من العناصر الجينية المتنقلة بدائية النواة المستمرة: الريبينات. REPINs عبارة عن عناصر قابلة للنقل غير مستقلة يتم تكرارها بواسطة عمليات نقل أحادية النسخة تسمى RAYTs. غالبًا ما تكون أنظمة REPIN-RAYT موروثة رأسيًا وقد استمرت في السلالات البكتيرية الفردية لملايين السنين. يمكن أن يكون اكتشاف وتحليل مجموعات REPIN وما يقابلها من عمليات نقل RAYT في الأنواع البكتيرية أمرًا شاقًا إلى حد ما، مما يعيق التقدم في فهم بيولوجيا REPIN-RAYT وتطورها. نقدم هنا RAREFAN، وهي خدمة ويب تحدد مجموعات REPIN وما يقابلها من RAYT transposase في مجموعة معينة من الجينومات البكتيرية. نعرض قدرات RAREFAN من خلال تحليل مجموعة مكونة من 49 جينومًا Stenotropicomonas maltophilia ، تحتوي على تسعة أنظمة REPIN-RAYT مختلفة. نحن نرشد القارئ خلال عملية تحديد وتحليل أنظمة REPIN-RAYT عبر S. maltophilia ، وتسليط الضوء على الارتباطات الخاطئة بين REPIN وRAYTs، وتقديم حلول حول كيفية العثور على الارتباطات الصحيحة. يتيح RAREFAN اكتشافًا سريعًا وواسع النطاق لـ REPINs وRAYTs، ويوفر نظرة ثاقبة للعالم الرائع لمجموعات التسلسل داخل الجينوم في الجينومات البكتيرية. RAREFAN متاح على http://rarefan.evolbio.mpg.de.

REPINs، RAYTs، العناصر القابلة للنقل، Stenotropicomonas maltophilia، مجموعات التسلسل

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

RAREFAN: un servicio web para identificar REPIN y RAYT en genomas bacterianos

En comparación con los eucariotas, las secuencias repetitivas son raras en los genomas bacterianos y normalmente no persisten por mucho tiempo. Sin embargo, existe al menos una clase de elementos genéticos móviles procarióticos persistentes: los REPIN. Los REPIN son elementos transponibles no autónomos replicados por transposasas de copia única llamadas RAYT. Los sistemas REPIN-RAYT se heredan en su mayoría verticalmente y han persistido en linajes bacterianos individuales durante millones de años. Descubrir y analizar poblaciones de REPIN y sus correspondientes transposasas RAYT en especies bacterianas puede resultar bastante laborioso, lo que obstaculiza el progreso en la comprensión de la biología y la evolución de REPIN-RAYT. Aquí presentamos RAREFAN, un servicio web que identifica poblaciones REPIN y su correspondiente transposasa RAYT en un conjunto determinado de genomas bacterianos. Demostramos las capacidades de RAREFAN analizando un conjunto de 49 genomas de Stenotrophomonas maltophilia , que contienen nueve sistemas REPIN-RAYT diferentes. Guiamos al lector a través del proceso de identificación y análisis de sistemas REPIN-RAYT en S. maltophilia , destacando asociaciones erróneas entre REPIN y RAYT, y brindando soluciones sobre cómo encontrar asociaciones correctas. RAREFAN permite la detección rápida y a gran escala de REPIN y RAYT, y proporciona información sobre el fascinante mundo de las poblaciones de secuencias intragenómicas en genomas bacterianos. RAREFAN está disponible en http://rarefan.evolbio.mpg.de.

REPIN, RAYT, elementos transponibles, Stenotrophomonas maltophilia, poblaciones de secuencia

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

RAREFAN : Un webservice pour identifier les REPIN et les RAYT dans les génomes bactériens

Par rapport aux eucaryotes, les séquences répétitives sont rares dans les génomes bactériens et ne persistent généralement pas longtemps. Pourtant, il existe au moins une classe d’éléments génétiques mobiles procaryotes persistants : les REPIN. Les REPIN sont des éléments transposables non autonomes répliqués par des transposases à copie unique appelées RAYT. Les systèmes REPIN-RAYT sont pour la plupart hérités verticalement et persistent dans des lignées bactériennes individuelles depuis des millions d'années. La découverte et l'analyse des populations REPIN et de leurs transposases RAYT correspondantes dans les espèces bactériennes peuvent être plutôt laborieuses, entravant les progrès dans la compréhension de la biologie et de l'évolution de REPIN-RAYT. Nous présentons ici RAREFAN, un service Web qui identifie les populations REPIN et leur transposase RAYT correspondante dans un ensemble donné de génomes bactériens. Nous démontrons les capacités de RAREFAN en analysant un ensemble de 49 génomes de Stenotrophomonas maltophilia , contenant neuf systèmes REPIN-RAYT différents. Nous guidons le lecteur à travers le processus d’identification et d’analyse des systèmes REPIN-RAYT à travers S. maltophilia , mettant en évidence les associations erronées entre REPIN et RAYT, et fournissant des solutions sur la façon de trouver des associations correctes. RAREFAN permet une détection rapide et à grande échelle des REPIN et des RAYT et donne un aperçu du monde fascinant des populations de séquences intragénomiques dans les génomes bactériens. RAREFAN est disponible sur http://rarefan.evolbio.mpg.de.

REPIN, RAYT, éléments transposables, Stenotrophomonas maltophilia, populations de séquences

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

RAREFAN: जीवाणु जीनोम में REPINs और RAYTs की पहचान करने के लिए एक वेबसेवा

यूकेरियोट्स की तुलना में, जीवाणु जीनोम में दोहराव वाले अनुक्रम दुर्लभ होते हैं और आमतौर पर लंबे समय तक बने नहीं रहते हैं। फिर भी, लगातार प्रोकैरियोटिक मोबाइल आनुवंशिक तत्वों का कम से कम एक वर्ग मौजूद है: REPINs। REPINs गैर-स्वायत्त ट्रांसपोज़ेबल तत्व हैं जिन्हें RAYTs नामक एकल-कॉपी ट्रांसपोज़ेस द्वारा दोहराया जाता है। REPIN-RAYT प्रणालियाँ अधिकतर लंबवत रूप से विरासत में मिली हैं और लाखों वर्षों से व्यक्तिगत जीवाणु वंशावली में बनी हुई हैं। बैक्टीरिया प्रजातियों में रेपिन आबादी और उनके संबंधित RAYT ट्रांसपोज़ेस की खोज और विश्लेषण करना काफी श्रमसाध्य हो सकता है, जिससे REPIN-RAYT जीव विज्ञान और विकास को समझने में प्रगति में बाधा आ सकती है। यहां हम RAREFAN प्रस्तुत करते हैं, एक वेबसेवा जो बैक्टीरिया जीनोम के दिए गए सेट में REPIN आबादी और उनके संबंधित RAYT ट्रांसपोज़ेज़ की पहचान करती है। हम नौ अलग-अलग REPIN-RAYT सिस्टम वाले 49 स्टेनोट्रोफोमोनस माल्टोफिलिया जीनोम के एक सेट का विश्लेषण करके RAREFAN की क्षमताओं को प्रदर्शित करते हैं। हम एस में रेपिन-आरएईटी सिस्टम की पहचान और विश्लेषण करने की प्रक्रिया के माध्यम से पाठक का मार्गदर्शन करते हैं। माल्टोफिलिया , REPIN और RAYTs के बीच गलत संबंधों को उजागर करना, और सही जुड़ाव खोजने के तरीके पर समाधान प्रदान करना। RAREFAN REPINs और RAYTs का तेजी से, बड़े पैमाने पर पता लगाने में सक्षम बनाता है, और बैक्टीरिया जीनोम में इंट्राजेनोमिक अनुक्रम आबादी की आकर्षक दुनिया में अंतर्दृष्टि प्रदान करता है। RAREFAN http://rarefan.evolbio.mpg.de.

पर उपलब्ध है

REPINs, RAYTs, ट्रांसपोज़ेबल तत्व, स्टेनोट्रोफोमोनास माल्टोफिलिया, अनुक्रम आबादी

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

RAREFAN: 細菌ゲノム内の REPIN と RAYT を特定する Web サービス

真核生物と比較すると、細菌のゲノムでは反復配列がまれであり、通常は長く存続しません。しかし、少なくとも 1 つのクラスの持続性原核生物の可動性遺伝要素、REPIN が存在します。 REPIN は、RAYT と呼ばれる単一コピーのトランスポザーゼによって複製される非自律的転移因子です。 REPIN-RAYT システムはほとんどが垂直方向に遺伝し、個々の細菌系統の中で数百万年にわたって存続してきました。細菌種における REPIN 集団とそれに対応する RAYT トランスポザーゼの発見と分析はかなり手間がかかり、REPIN-RAYT の生物学と進化の理解の進歩を妨げる可能性があります。ここでは、特定の細菌ゲノムのセット内の REPIN 集団とそれに対応する RAYT トランスポザーゼを特定する Web サービスである RAREFAN を紹介します。私たちは、9 つの異なる REPIN-RAYT システムを含む 49 の ステノトロフォモナスマルトフィリア ゲノムのセットを分析することにより、RAREFAN の機能を実証します。私たちは、 S 全域にわたる REPIN-RAYT システムを特定して分析するプロセスを読者にガイドします。マルトフィリア では、REPIN と RAYT の間の誤った関連性を強調し、正しい関連性を見つける方法に関する解決策を提供します。 RAREFAN は、REPIN と RAYT の迅速かつ大規模な検出を可能にし、細菌ゲノムのゲノム内配列集団の興味深い世界への洞察を提供します。 RAREFAN は http://rarefan.evolbio.mpg.de で入手できます。

REPIN、RAYT、転移因子、Stenotrophomonas maltophilia、配列集団

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

RAREFAN: Um serviço web para identificar REPINs e RAYTs em genomas bacterianos

Comparadas aos eucariontes, as sequências repetitivas são raras nos genomas bacterianos e geralmente não persistem por muito tempo. No entanto, existe pelo menos uma classe de elementos genéticos móveis procarióticos persistentes: REPINs. REPINs são elementos transponíveis não autônomos replicados por transposases de cópia única chamadas RAYTs. Os sistemas REPIN-RAYT são em sua maioria herdados verticalmente e persistiram em linhagens bacterianas individuais por milhões de anos. Descobrir e analisar populações REPIN e suas correspondentes transposases RAYT em espécies bacterianas pode ser bastante trabalhoso, dificultando o progresso na compreensão da biologia e evolução do REPIN-RAYT. Apresentamos aqui o RAREFAN, um serviço web que identifica populações REPIN e sua correspondente transposase RAYT em um determinado conjunto de genomas bacterianos. Demonstramos as capacidades do RAREFAN analisando um conjunto de 49 genomas de Stenotrophomonas maltophilia , contendo nove sistemas REPIN-RAYT diferentes. Orientamos o leitor através do processo de identificação e análise de sistemas REPIN-RAYT em S. maltophilia , destacando associações errôneas entre REPIN e RAYTs e fornecendo soluções sobre como encontrar associações corretas. RAREFAN permite a detecção rápida e em larga escala de REPINs e RAYTs e fornece informações sobre o fascinante mundo das populações de sequências intragenômicas em genomas bacterianos. RAREFAN está disponível em http://rarefan.evolbio.mpg.de.

REPINs, RAYTs, elementos transponíveis, Stenotrophomonas maltophilia, populações de sequências

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

RAREFAN: веб-сервис для идентификации REPIN и RAYT в геномах бактерий.

По сравнению с эукариотами, повторяющиеся последовательности в бактериальных геномах встречаются редко и обычно не сохраняются долго. Тем не менее, существует по крайней мере один класс стойких мобильных генетических элементов прокариот: REPIN. REPIN — это неавтономные мобильные элементы, реплицируемые однокопийными транспозазами, называемыми RAYT. Системы РЕПИН-РАЙТ в основном наследуются вертикально и сохраняются в отдельных бактериальных линиях на протяжении миллионов лет. Обнаружение и анализ популяций REPIN и соответствующих им транспозаз RAYT у видов бактерий может быть довольно трудоемким, что затрудняет прогресс в понимании биологии и эволюции REPIN-RAYT. Здесь мы представляем RAREFAN, веб-сервис, который идентифицирует популяции REPIN и соответствующую им транспозазу RAYT в заданном наборе бактериальных геномов. Мы демонстрируем возможности RAREFAN, анализируя набор из 49 геномов Stenotropomonasmaltophilia , содержащих девять различных систем REPIN-RAYT. Мы проведем читателя через процесс идентификации и анализа систем REPIN-RAYT в S. мальтофилия , подчеркивая ошибочные ассоциации между REPIN и RAYT и предлагая решения, как найти правильные ассоциации. RAREFAN обеспечивает быстрое и крупномасштабное обнаружение REPIN и RAYT и дает представление об увлекательном мире популяций внутригеномных последовательностей в бактериальных геномах. RAREFAN доступен по адресу http://rarefan.evolbio.mpg.de.

REPIN, RAYT, транспозируемые элементы, Stenotropomonasmaltophilia, популяции последовательностей.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

RAREFAN：用于识别细菌基因组中 REPIN 和 RAYT 的网络服务

与真核生物相比，重复序列在细菌基因组中很少见，并且通常不会持续很长时间。然而，至少有一类持久性原核移动遗传元件：REPIN。 REPIN 是由称为 RAYT 的单拷贝转座酶复制的非自主转座元件。 REPIN-RAYT 系统大多是垂直遗传的，并且在单个细菌谱系中持续存在了数百万年。在细菌物种中发现和分析 REPIN 群体及其相应的 RAYT 转座酶可能相当费力，阻碍了理解 REPIN-RAYT 生物学和进化的进展。在这里，我们介绍 RAREFAN，这是一种网络服务，可识别给定细菌基因组中的 REPIN 群体及其相应的 RAYT 转座酶。我们通过分析一组 49 个嗜麦芽寡养单胞菌基因组（包含九种不同的 REPIN-RAYT 系统）来展示 RAREFAN 的功能。我们引导读者完成识别和分析嗜麦芽糖链球菌 REPIN-RAYT 系统的过程，强调 REPIN 和 RAYT 之间的错误关联，并提供如何找到正确关联的解决方案。 RAREFAN 能够快速、大规模地检测 REPIN 和 RAYT，并提供对细菌基因组中基因组内序列群体的迷人世界的深入了解。 RAREFAN 可在 http://rarefan.evolbio.mpg.de 上获取。

REPIN、RAYT、转座元件、嗜麦芽寡养单胞菌、序列群体

Submission: posted 07 June 2022
Recommendation: posted 02 February 2023, validated 07 February 2023

Cite this recommendation as:
Douglas, G. (2023) A workflow for studying enigmatic non-autonomous transposable elements across bacteria. Peer Community in Genomics, 100166. https://doi.org/10.24072/pci.genomics.100166

Recommendation

Repetitive extragenic palindromic sequences (REPs) are common repetitive elements in bacterial genomes (Gilson et al., 1984; Stern et al., 1984). In 2011, Bertels and Rainey identified that REPs are overrepresented in pairs of inverted repeats, which likely form hairpin structures, that they referred to as “REP doublets forming hairpins” (REPINs). Based on bioinformatics analyses, they argued that REPINs are likely selfish elements that evolved from REPs flanking particular transposes (Bertels and Rainey, 2011). These transposases, so-called REP-associated tyrosine transposases (RAYTs), were known to be highly associated with the REP content in a genome and to have characteristic upstream and downstream flanking REPs (Nunvar et al., 2010). The flanking REPs likely enable RAYT transposition, and their horizontal replication is physically linked to this process. In contrast, Bertels and Rainey hypothesized that REPINs are selfish elements that are highly replicated due to the similarity in arrangement to these RAYT-flanking REPs, but independent of RAYT transposition and generally with no impact on bacterial fitness (Bertels and Rainey, 2011).

This last point was especially contentious, as REPINs are highly conserved within species (Bertels and Rainey, 2023), which is unusual for non-beneficial bacterial DNA (Mira et al., 2001). Bertels and Rainey have since refined their argument to be that REPINs must provide benefits to host cells, but that there are nonetheless signatures of intragenomic conflict in genomes associated with these elements (Bertels and Rainey, 2023). These signatures reflect the divergent levels of selections driving REPIN distribution: selection at the level of each DNA element and selection on each individual bacterium. I found this observation particularly interesting as I and my colleague recently argued that these divergent levels of selection, and the interaction between them, is key to understanding bacterial pangenome diversity (Douglas and Shapiro, 2021). REPINs could be an excellent system for investigating these levels of selection across bacteria more generally.

The problem is that REPINs have not been widely characterized in bacterial genomes, partially because no bioinformatic workflow has been available for this purpose. To address this problem, Fortmann-Grote et al. (2023) developed RAREFAN, which is a web server for identifying RAYTs and associated REPINs in a set of input genomes. The authors showcase their tool by applying it to 49 Stenotrophomonas maltophilia genomes and providing examples of how to identify and assess RAYT-REPIN hits. The workflow requires several manual steps, but nonetheless represents a straightforward and standardized approach. Overall, this workflow should enable RAYTs and REPINs to be identified across diverse bacterial species, which will facilitate further investigation into the mechanisms driving their maintenance and spread.

References

Bertels F, Rainey PB (2023) Ancient Darwinian replicators nested within eubacterial genomes. BioEssays, 45, 2200085. https://doi.org/10.1002/bies.202200085

Bertels F, Rainey PB (2011) Within-Genome Evolution of REPINs: a New Family of Miniature Mobile DNA in Bacteria. PLOS Genetics, 7, e1002132. https://doi.org/10.1371/journal.pgen.1002132

Douglas GM, Shapiro BJ (2021) Genic Selection Within Prokaryotic Pangenomes. Genome Biology and Evolution, 13, evab234. https://doi.org/10.1093/gbe/evab234

Fortmann-Grote C, Irmer J von, Bertels F (2023) RAREFAN: A webservice to identify REPINs and RAYTs in bacterial genomes. bioRxiv, 2022.05.22.493013, ver. 4 peer-reviewed and recommended by Peer Community in Genomics. https://doi.org/10.1101/2022.05.22.493013

Gilson E, Clément J m., Brutlag D, Hofnung M (1984) A family of dispersed repetitive extragenic palindromic DNA sequences in E. coli. The EMBO Journal, 3, 1417–1421. https://doi.org/10.1002/j.1460-2075.1984.tb01986.x

Mira A, Ochman H, Moran NA (2001) Deletional bias and the evolution of bacterial genomes. Trends in Genetics, 17, 589–596. https://doi.org/10.1016/S0168-9525(01)02447-7

Nunvar J, Huckova T, Licha I (2010) Identification and characterization of repetitive extragenic palindromes (REP)-associated tyrosine transposases: implications for REP evolution and dynamics in bacterial genomes. BMC Genomics, 11, 44. https://doi.org/10.1186/1471-2164-11-44

Stern MJ, Ames GF-L, Smith NH, Clare Robinson E, Higgins CF (1984) Repetitive extragenic palindromic sequences: A major component of the bacterial genome. Cell, 37, 1015–1026. https://doi.org/10.1016/0092-8674(84)90436-7

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
Our work was funded by the Max Planck Society

Reviews

Evaluation round #2

DOI or URL of the preprint: https://www.biorxiv.org/content/10.1101/2022.05.22.493013v3

Version of the preprint: 3

Author's Reply, 26 Jan 2023

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.genomics.100166.ar2

Decision by Gavin Douglas, posted 29 Dec 2022, validated 03 Jan 2023

Hi Dr. Bertels and colleagues,

Both reviewers assessed your changes and agree that the manuscript is greatly improved. They have raised a few remaining points that warrant some further minor changes and clarifications.

In addition, please see my own minor comments below, which are primarily typo and phrasing fixes.

All the best,

Gavin Douglas

The license information for the source of Figure 1 (Bertels, Rainey, 2022) should be indicated somewhere in the text. There are usually requirements for how to redistribute/modify items from another work. E.g., if this is under a creative commons license, then you should state what license version it corresponds to and give a link to that license.

The title should be changed to “RAREFAN: A webservice”… rather than “RAREFAN: a webservice…”

L67-72 I think many readers will be curious to know what the other RAYT families are associated with, if not REPINs. Since they are defined as “REP-associated” (in their name) I think this deserves at least a quick mention in a sentence or two.

Currently at least one parameter value in the Figure 2 mismatches with the Figure legend (distance between inverted sequences being 200bp vs 130bp). The authors should make sure that the values reported represent the current default values and old (or otherwise conflicting) values are not mismatched between the figure and legend, to avoid reader confusion.

Also in Figure 2 – I recommend that the figure legend be simplified, as many of the details are already provided in the methods are not really pertinent to interpreting the plot and the take-home messages. I will leave that to the authors’ discretion. However, I do strongly recommend that the references in this legend be removed, as these should be mentioned in the appropriate section of the methods instead (references are generally uncommon in figure legends).

I was unable to access results under run ID a2ijpkk6.

The authors should clarify the protocol on linking RAYTs to REPINs. Is it generally expected for at least one REP to be within 200bp of the corresponding RAYT? Since the RAYTs act in trans as proteins, there does not seem to be any reason why this necessarily be true, so I think a little additional explanation would be helpful.

Should use past tense when discussing specific results. So on L338 for instance, it should be “RAREFAN detected three populations when S. maltophilia Sm54 was selected as the reference strain”.

The authors should use “p-value”, “P-value”, or “p value”, but not “p-Value”, which is the current usage in the text.

Minor edits

L30 – “providing” was actually grammatically correct, and so the revised change to “provide” should be undone.

L48 – “REP sequences is” should be “REP sequences are”

L53 – I suggest “not mobile anymore” be reworded to “immobile” or “no longer mobile”

L58 – “associated to” should be “associated with”

L64 – I suggest “very special” be changed to “unique”

L77 – I think the year estimate should be clarified. Presumably some RAYT/REPIN groups may have been present in a lineage for less than a million years (or at least this is possible!). So I would re-word to say that “they have been evolving in single bacterial lineages for up to millions of, or perhaps even one billion, years.”

L102 – I think “Yet,” should be removed, or perhaps replaced with “Unfortunately,”

L104– “ins and outs” should be replaced with less colloquial language, such as “details” or “detailed features”

L105 – “the genome” should be “a genome”

L106 – “analyzed next” should be “then analyzed”

L106-107 – “If they are exclusively” should be something clearer like “If these sequences are exclusively”

L107-108 – I would put commas on each side of this sentence fragment: “and present in only one or two loci in the genome”

Figure 2 legend “the” should be re-added in front of “seed sequence”.

Implementation section of methods – python, java, flask, and shiny should all be capitalized

Regarding “Query RAYT” bullet point in implementation methods: above this is described as optional. The authors should clarify the procedure when this protein sequence is not provided, as is currently done for the Tree file option.

L180 – “(n-1)” should be “[n-1]”

L277 – “Especially” should be “This is especially true”

Figure 6 legend – here “group” is capitalized in some but not all cases. In this legend (and in the relevant section of the main text, where this also varies), the authors should consistently write “group” capitalized or lowercase in all instances.

L431-432 – “Genbank” should be “GenBank” and I think it would be clearer to say “creating a RAREFAN Galaxy workflow” rather than “integrating RAREFAN into workflows such as Galaxy”, as Galaxy is a means of making workflows available online for easy use, rather than referring to a specific workflow.

https://doi.org/10.24072/pci.genomics.100166.d2

Reviewed by Sophie Abby, 23 Dec 2022

The new version of this manuscript is much improved, with a more detailed biological background and the provided clarifications on methods in main text and figures, as well as a new section on performances. Even though it is clear that manual curation might still be needed to assess the relevance of the provided results, the limits of the approaches are outlined with examples discussed and possible contingency plans. Therefore, and given the fact that there is so far no resource available to investigate REPIN-RAYT, I believe that the RAREFAN tool is valuable to the microbiologists community, and could in principle be supportive of its recommendation by PCI Genomics, provided that the pointed issues on the webserver are sorted out.

On main text:

Here are a few minor points/typos:

- It might be good to add the link to the webserver at the end of the abstract.

- Line 165. “Association distance REPIN-RAYT”. Please provide the default values.

- Line 187: “Among all identified REP and REPIN sequences REPIN populations can be isolated.” Something is wrong with this sentence. Is a word missing?

- Line 223: “Complete RAREFAN data used for analysis can be accessed by using the run IDs listed in Table 1.” You mentioned that the results from users are stored for 180 days. Would the IDs listed in Table 1 be stably kept over time?

- Figure 4 legend: “In an equilibrium” => “At equilibrium”?

- Line 281: “The RAREFAN webserver visualizes REPIN population size” => “enables to visualize…”?

- Line 327: “In some RAREFAN runs associations between RAYTs and REPINs are not

monophyletic”. Please reformulate, associations are not the ones to be monophyletic.

- Figure 6 legend: “connect two sequence cluster.” => clusterS

- Line 180: “have been observed”

Testing the webserver: http://rarefan.evolbio.mpg.de

I could test the webserver, after the issue described in the email exchange reported below was solved.

- I could submit 8 complete genomes of Klebsiella for a run (z9hgj3ld, results accessible here: http://rarefan.evolbio.mpg.de/results?run_id=z9hgj3ld ) which found no results. I thus changed the threshold of occurrences “min_nmer_occurrence” to 5 (re-run job y8586vnk – results accessible here: http://rarefan.evolbio.mpg.de/results?run_id=y8586vnk ), and could obtain some “RAYT” occurrences.

I found it difficult to interpret the master table of results on the summary page. For instance for the run y8586vnk, it seemed from the table that there were only the 0 group for which there were REP/REPINs occurrences. However, when clicking on “Plot data” (results accessible here: http://rarefan.evolbio.mpg.de//shiny/analysis/?run_id=y8586vnk ) there were REPINs identified for group 4 (and not only group 0 as reported in the table).

- Also, I came across a minor issue, in another run (sd0oyhv1 http://rarefan.evolbio.mpg.de/results?run_id=sd0oyhv1): I submitted six genomes, but one was judged unfit under this error message: “GCA_000009985.1_ASM998v1_genomic.fna contains non-DNA sequences and will be removed”, and dropped out of the analysis. However, this is a genomic FASTA file obtained from the NCBI/Refseq database.
I looked into the file and found a few “N” characters, a standard letter to represent “any nucleotide”. Maybe could the authors take into account that N characters could be present in some genomes, and more thoroughly test the way the nature of the FASTA files are provided?

----------------------- ADDENDUM -----------------------

Email exchange with Dr. Frederic Bertels

===== Dr Abby to Dr Bertels, 20th of Dec 2022 =====

“I am writing to you to follow-up on the revised version of your article on the RAREFAN webserver.

I've been trying to test the webserver. But I could not manage to obtain results. Therefore I am unable to complete my review.

I've submitted three runs. Unfortunately, I made a mistake with the 1st run and submitted the entire proteome of an organism instead of the entire genome (run ID gvnnqk79). This is a silly mistake on my side, however you might expect such common mistakes to be made on a public server.

I don't know if it is related to this, but then I've submitted two more runs with appropriate genome files, but they have been stuck in the queue since yesterday, while the first job (gvnnqk79) seems to be stuck at the "Rarefan - started" stage.

The ID for the jobs are the following: gvnnqk79; _xwiv2us; clj3vckt

Could you please have a look, and let us know what is going on and when we will be able to test the server?”

===== Dr Bertels to Dr abby, 21st of Dec 2022 =====

“we fixed the issue you encountered. It was actually caused by a full error stream buffer that was filled by tons of error messages from the BLAST formatdb command. We have not encountered this error previously since the formatdb command does generally not produce large error messages. The buffer is large enough to store small to medium sized error messages that can be read once formatdb is finished. However, the error messages produced by generating a DNA BLAST database from protein sequences completely filled up the buffer, the program was then paused and waited for the buffer to be emptied so it could continue writing the error messages. The emptying never happened and the program did not finish. The waiting program in turn clogged up the server queue and prevented other jobs from being run.

We changed RAREFAN so the buffer is now continuously read, which should prevent a deadlock (at least at that position in the code). We are also testing whether the submitted sequence is a DNA sequence, if it is not a DNA sequence then an error message is thrown. We also are implementing a function that kills any job that has been running for more than 3h.

We hope that these changes will prevent the problems that you have experienced in the future.”

https://doi.org/10.24072/pci.genomics.100166.rev21

Reviewed by anonymous reviewer 1, 07 Dec 2022

I want to thank the authors for their thorough response to the reviewer's comments. In my opinion, the manuscript improved substantially, and I have only few comments left.

In the methods, it might work better to first describe the Identification of the REPs, REPINs, and RAYTs, and afterwards the implementation and usage of the webserver.

The S. maltophilia example is very interesting due to the patchy presence-absence patterns of REPIN-RAYT systems. Do the authors have any idea how this patchiness evolved, given that the systems evolve vertically?

The risk of confusing CRISPRs with REPINs is mentioned in the introduction and methods. Is it recommended to run a CRISPR detection tool and remove the identified regions from the REPIN candidates? Could this be integrated into the pipeline?

As I understand, the RAYT sequence needs to be known by the user or one of the 2 known sequences needs to be chosen. However, there might not be previous knowledge on the RAYT sequence in the organism. Would it be feasible to include blasting against all known RAYT variants in the pipeline?

https://doi.org/10.24072/pci.genomics.100166.rev22

Evaluation round #1

DOI or URL of the preprint: https://www.biorxiv.org/content/10.1101/2022.05.22.493013v2

Author's Reply, 15 Nov 2022

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.genomics.100166.ar1

Decision by Gavin Douglas, posted 21 Jul 2022

Two reviewers have now finished their reports and they have highlighted numerous points that should be addressed. The main critique appears to be that further clarification is needed, both in terms of the motivation for annotating these elements in particular and regarding various technical details of your approach.The second reviewer also highlighted several practical issues (as well as discrepancies in the results themselves) that they ran into when trying to run the tool, which I found especially concerning.

I think all of the points that were raised are constructive and should help to improve the manuscript substantially. I look forward to seeing the next version!

https://doi.org/10.24072/pci.genomics.100166.d1

Reviewed by anonymous reviewer 1, 15 Jun 2022

This preprint presents a tool to identify a particular class of mobile elements in bacterial genomes. Such a tool will make these elements more easy to detect and will allow a wider audience to annotate them. However, some manual steps in the annotation are still necessary which might limit the application of the tool.

The broad scope of annotating REPINs is not completely clear to me. The manuscript gives the impression that manual steps are still needed to annotate REPINs and to link them to RAYTs. Thus, it is currently not possible to include this annotation into pipelines for prokaryote genome annotation (such as PROKKA).

The introduction could be more explicit on the motivation of the study? Why do researchers want to identify REPINs? What kind of studies could this identification support?

The manuscript lacks an introduction into REPINs. How are they defined and how do they look like? E.g., it is mentioned that they are repetitive sequences. How long can the repeats be, how many repeats are there, are they 100% identical, are they consecutive? Although this information is present in previous papers, it is crucial for this manuscript and I suggest to include it in the introduction. Also, it only becomes clear in the discussion that there are symmetric and asymmetric REPINs and the tool only identifies the former ones. Such limitations should be stated in the introduction or methods.

The introduction states "The study of REPIN populations and their corresponding RAYTs can be cumbersome." The authors might want to mention the particular challenges in the introduction.

The paper focusses on bacterial REPINs. Do these elements also occur in archaea? Would the tool work for archaeal genomes? That would be interesting to mention in the introduction.

The data set is linked on the RAREFAN website (but 50 strains are mentioned, whereas there are 49 in the manuscript). However, the access is restricted. The access should be unrestricted for review.

Fig. 1
It is unclear what kind of threshold is meant in "Determine all 21bp long sequences above a certain threshold".
"vicinity (<30bp)", however in the legend and in the text it is described that sequences that occur within 15bp are grouped.
legend: "Hence, we grouped all sequences that occur within 15 bp of each other, anywhere in the genome." It is unclear what "anywhere in the genome" means in that context. As I understand, they are within 15bp, which is not consistent with "anywhere".

As described, identical 21bp long sequences are grouped by distance. Then the seed sequence is extracted as the most common sequence in each group (line 105). How can there be multiple different sequences within each group? As I understood, they should all be identical.

It is mentioned that the genomes should "ideally" be fully sequenced and complete. Does the tool also work with contig-state draft genomes?

The results demonstrate very well how the results depend on the reference genome. The authors then suggest to run the tool with multiple different reference genomes. However, this needs to be done manually, and the potentially different links between REPINs and RAYT are currently resolved manually by the authors (Fig. 2). I got the impression that expert knowledge on REPINs is still required to resolve these multiple runs. Thus, I wonder, whether this process could be automated. I.e. could the analysis be run iteratively with each genome as a reference genome and the results are then merged? This would allow for a fully automated analysis given a set of strains and would largely improve the usability of the tool.

The results also nicely demonstrate how the results depend on the chosen parameters, e.g., the frequency threshold 55. This number looks indeed quite high given the results presented in Fig. 3B. Why is such a high threshold chosen? Do false positive findings increase with lower thresholds? It would be very interesting to discuss this.

An example is described where REPIN groups can be merged (line 300). It is unclear if that is done automatically by the tool.

The authors mention that "the only known asymmetric REPIN population are E. coli REPINs." I wonder if that is due to the difficulty in the identification of asymmetric REPINs? Might they have been overlooked?

https://doi.org/10.24072/pci.genomics.100166.rev11

Reviewed by Sophie Abby, 21 Jul 2022

Review of:

“RAREFAN: a webservice to identify REPINs and RAYTs in bacterial genomes”

In this article, Fortmann-Grote and colleagues present a webservice to identify in bacterial genomes a class of repetitive elements and the associated transposase, namely the REPIN and RAYT. These mobile elements are quite intriguing as they seem to be largely vertically transmitted (i.e. the transposase seems to be rather immobile). Their function is still to be determined. Beyond these elements detection, the webservice also provides some graphs to analyse the search results. As a test case, the authors applied the search engine to a set of 49 genomes of the bacterium Stenotrophomonas maltophilia. The results and limitations of the search are discussed, and some guidelines provided for the users to obtain the most relevant pictures of these elements distribution in the genomes of interest.

The webservice provided could prove useful to microbiologists in need to analyse characteristics of their genomes, and could speed up research on these particular mobile elements. However overall, I found that the description of the method proposed could be largely improved. And I report several inconsistencies observed when running the webservice on authors-provided or original genome datasets, making the webservice results difficult to interpret. I give more details on these aspects and more, in the following review.

Manuscript review: major points

- The introduction lacks the necessary biological background to understand the choices made for the search engine implementation. For instance, how many copies of a given REP are usually found in genomes? Of a given REPIN? Why a default number of 55 copies to consider a REP for further search? Are REPs found in REPINs structures always that abundant in genomes? Or are there some REPINs that do correspond to lowly abundant REP? How long are REPs in REPIN? How long are REPINs? Why use REPs of 21 bp when previous papers by the authors use for instance 16mer searches (Bertels & Bainey 2011)? How many RAYTs are usually found in a genome, are they genetically linked to REPINs? etc… Adding such a paragraph could help the readers to understand the method proposed for REPIN+RAYT detection.

- I know it is “only” a matter of nomenclature but could the authors also mention other names attributed to RAYT? From the Ton-Hoang 2012 paper for instance (TnpAREP if I’m correct)? That could help researchers that are unfamiliar with the literature and the field of repetitive elements to understand exactly what RAREFAN is about.

- As described in Figure 1 and in the main text, I could not properly understand how the REPIN search functions. Please clarify considerably both the figure and the text.

In particular:

1) On Fig. 1:

--- A step => add perhaps optional input files (for instance a genome phylogeny if I got it right?)

--- B step => “Identifying REP sequence groups” this title would be more explanatory (if I’m correct?). Otherwise please clarify what are “sequence groups”.

Step 1) “Determine 21bp long sequences above a certain threshold” of what (number of occurrences, right?)? etc…

Step 2) It is unclear the difference between the groups. Sequences are grouped by vicinity on the reference genome sequence? based on sequence similarity? Please clarify the text.

--- B step => performed on a reference “genome” add “genome”?

--- B step overall schema could probably be improved to increase clarity.

--- C step => “of each for each” typo?

--- C step => step 2) REPins are identified from pairs of REPs from within a same group? Or not necessarily? Please clarify.

--- The parameters that can be changed by the user could be mentioned on Fig. 1.

--- Add at which step is the genome phylogeny computed (and with what). Is this an optional or mandatory step? etc…

2) In main text:

--- Line 70, it is mentioned that MCL is used to cluster REPIN sequences. When is this used in RAREFAN? It does not seem to appear on Figure 1.

--- Line 104 “All sequences occurring… at least once within 15bp of each other” => I don’t understand, could you please clarify? Where does this appear on Fig. 1? Is it rather the 30bp vicinity of step B2?

--- Lines 113-114: it is unclear to me whether Group 2 or Group 3 RAYT reference sequences would be used, or both. Please clarify. Is that the user choice? Can both be used if no a priori knowledge is held on which type to find in the genomes to analyse? Also, could you remind here which tblastn parameter is used (cf. line 88)?

--- Line 117: please add more explanations on how REPIN populations and RAYT are linked.

--- Line 120: please add that it is a user-provided genome phylogeny or a computed one (it was unclear to me, I only got it when going through the webservice pages).

- The authors state that the described method to detect REP sequences has already been described elsewhere (in articles by the authors themselves), but that the present implementation is “slightly improved”. Could the authors clarify what is different from the previous methodology, and how this is an improvement? How do the results compare to previous genome analyses performed in some of the cited papers (for instance 1st paragraph of results?).

- Line 171: the authors “suggest to perform multiple RAREFAN runs with different reference strains.” Could there be a relevant way to automatically merge the results from different runs?

- In relation to above comment: Please state in the methods which genomes were used as a reference for the five different runs mentioned in Line 239. How did the authors choose these 5 genomes (sometimes, four are mentioned?), and could there be some hints on how to choose them (ANI-based? based on the genome phylogeny...)?

- Line 180-181: what happens if the seed sequence frequency threshold is lowered for REP search? Would that result in many false positives for REPINs? Or would the obtained candidate REPs naturally be expunged as not part of REPINs? And in terms of computation, would that be considerably slower?

- On the same note, could the authors give a hint about the computational time required and how it scales with the size of the genome dataset to analyse?

- Line 218-225: Interesting observations about the presence of RAYT and REPIN population sizes, but please provide numbers and statistics for the statements in this paragraph.

- Line 244-245: “A detailed analysis of the extragenic space of “wrongly” associated RAYT genes showed that these genes are flanked by seed sequences from two different REPIN populations”.

So how is this handled by RAREFAN? How is this decided which REPIN population is assigned to a RAYT exactly? On Line 117 it is simply written that "The presence of RAYTs in the vicinity of a particular REPIN can be used to establish the association between the RAYT gene and a REPIN group". Could this be possible to assign to a RAYT the REPIN population that is most often found next to it? Could this be signified in the log or output files that there are some ambiguities to help guide the user?

- Line 254-256: can the user change the 130bp parameter between a RAYT and REPIN to consider them associated? Please clarify in the text.

- Lines 272-273 and 280: Couldn't the problem of merged seed groups or split seed groups be sorted automatically by using a sequence clustering and "dereplication" approach to identify seed sequence to be used for the search (or is this already the case and I didn’t get it)? More generally, what improvements could the authors envision for their tools? Could this be discussed in the Discussion section?

Manuscript review: minor points

- “Stenotrophomonas maltophilia” is misspelled line 15 in the keyword list on page 1.

- Line 18 in the abstract: saying that “mobile genetic elements are rare in bacterial genomes” may be a bit strong. Maybe could this more specifically only refer to repetitive elements? If the authors agree with this?

- Line 21: instead of “are vertically inherited”, could the authors consider changing to “seem mostly vertically inherited”? To nuance a bit, as these elements have not been thoroughly studied in many genomes so far?

- Line 92: could this be specified on which servers is RAREFAN run? Is it stably maintained?

- Line 121, you define what is a “master sequence”. Could this concept also appear on Fig. 1 for homogeneity sake?

- Line 212, “P. chlororaphis” please spell out the entire genus name upon first appearance.

Test of the webservice http://rarefan.evolbio.mpg.de/

Overall I found difficult to understand the results. Also, I found confusing/inconsistent some of the output sentences on the main Results page and error/warning messages, when faced to the output files results. I also had server connexion issues when accessing the Plot data section. Whether this was a temporary issue with the server or something recurrent, I could not say. Here are the details:

- On the main Results page, regarding REPINs appears the number of REPINs detected in the reference genome. Could it be possible to display the number of REPIN groups and how they distribute among genomes? On the form of a simple table for instance?

- I ran RAREFAN using the “Dodkonia” test dataset provided on the website (from Zenodo) with default parameters (including reference genome chosen by default, dsw-1) and sequence data contained in the “in” folder, there were warnings or errors raised:

“Status: complete with warnings
There have been warning or errors during the postprocessing of your results. Please inspect the output data and logfile (out/rarefan.log) carefully.”

Is this related to the first line of the rarefan.log file reading: “Wrong letter in DNA sequence: |”? I obtained this error with multiple input datasets, is this a bug?

- Using the same “Dodkonia” test dataset, there were no RAYT identified. But several REPIN groups. However, I don’t understand in the Plot data, why the histogram of the REPIN population size (“REPINs” tab in the analysis toolbox) shows only for REPIN population 0, but does not show along trees starting from REPIN group 1? How many REPIN populations were proposed? Where is this information is provided (see also my comment above)?

- When using a dataset I chose (5 Kingella kingae genomes, ran with different reference genomes: runs IDs 92cx136, b2ecb95l and _v6qq4vm), I had the following message on the Results page:
“REPINs
. There was a problem with the REP(IN) analysis output data. Please check your results carefully.”

When is this message provided, and could it be more explicit? Is it linked to the following sentence?

“We detected 0 REPINs in the reference genome.”

- I got the following message on the Kingella dataset:
“Seed sequences
There are 0 21bp long sequences in the reference genome that occur more frequently than 55 times.”

I don’t understand this, as there were several REPIN proposed subsequently? Including in the reference genome? Arent’ the REPIN searches based on REP found in the reference genome, as suggested by Fig. 1? Moreover, there were >70 sequences listed as overrepresented in the file “.overrep”. (example of runs “_v6qq4vm”, or run “b2ecb95l”).

- I could not find the output file called “prox.stats” in both runs (Dodkonia and Kingella) in the downloaded folders. However, they were available in the Dodkonia “out” folder provided on Zenodo.

- I don’t understand why certain maxREPIN_[0-5] files are empty? Could the reason be added to the output file description? Goes the same for presAbs_[0-5].txt files

- When clicking the “Plot data” link, I repeatedly had issues with accessing these. It said: “Disconnected from the server. Reload ”

- Just an observation, in “results.txt”, it seems that the names of the genome files on the form of “GCF_11612705” have been parsed, resulting in 5 columns whenever there are 4 columns in the same output file for the Dodkonia dataset.

- Could the run number be reported in the rarefan.log file? It would be convenient to the user to access previous runs’ results stored on the server. For how long are these runs’ results stored?

- When downloading the Results data as an archive, would it be possible to add to the archive a README file describing the output files? It could for example be directly taken from the text of the http://rarefan.evolbio.mpg.de/manual page, section “File output”.

Download the review https://doi.org/10.24072/pci.genomics.100166.rev12