Close printable page

Recommendation

A systematic approach to the study of GC-biased gene conversion in mammals

Carina Farah Mugal based on reviews by Fanny Pouyet , David Castellano and 1 anonymous reviewer

A recommendation of:

Fine-scale quantification of GC-biased gene conversion intensity in mammals

Nicolas Galtier (2021), bioRxiv, 2021.05.05.442789, ver. 5 peer-reviewed and recommended by Peer Community in Genomics https://doi.org/10.1101/2021.05.05.442789

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Fine-scale quantification of GC-biased gene conversion intensity in mammals

GC-biased gene conversion (gBGC) is a molecular evolutionary force that favours GC over AT alleles irrespective of their fitness effect. Quantifying the variation in time and across genomes of its intensity is key to properly interpret patterns of molecular evolution. In particular, the existing literature is unclear regarding the relationship between gBGC strength and species effective population size, Ne. Here we analysed the nucleotide substitution pattern in coding sequences of closely related species of mammals, thus accessing a high resolution map of the intensity of gBGC. Our maximum likelihood approach shows that gBGC is pervasive, highly variable among species and genes, and of strength positively correlated with Ne in mammals. We estimate that gBGC explains up to 60% of the total amount of synonymous AT→GC substitutions. We show that the fine-scale analysis of gBGC-induced nucleotide substitutions has the potential to inform on various aspects of molecular evolution, such as the distribution of fitness effects of mutations and the dynamics of recombination hotspots.

GC-content; effective population size; recombination hotspots; population genomics

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

القياس الكمي الدقيق لكثافة تحويل الجينات المتحيزة لـ GC في الثدييات

إن تحويل الجينات المتحيز لـ GC (gBGC) هو قوة تطورية جزيئية تفضل GC على أليلات AT بغض النظر عن تأثير لياقتها. يعد قياس التباين في الوقت وعبر الجينومات في شدته أمرًا أساسيًا لتفسير أنماط التطور الجزيئي بشكل صحيح. على وجه الخصوص، الأدبيات الموجودة غير واضحة فيما يتعلق بالعلاقة بين قوة gBGC وحجم السكان الفعال للأنواع، Ne. قمنا هنا بتحليل نمط استبدال النوكليوتيدات في تسلسلات الترميز لأنواع الثدييات ذات الصلة الوثيقة، وبالتالي الوصول إلى خريطة عالية الدقة لكثافة gBGC. يُظهر نهج الاحتمال الأقصى لدينا أن gBGC منتشر ومتغير بدرجة كبيرة بين الأنواع والجينات، وترتبط قوته بشكل إيجابي مع Ne في الثدييات. نحن نقدر أن gBGC يفسر ما يصل إلى 60% من المبلغ الإجمالي لبدائل AT → GC المترادفة. لقد أظهرنا أن التحليل الدقيق لبدائل النوكليوتيدات المستحثة بـ gBGC لديه القدرة على تقديم معلومات عن جوانب مختلفة من التطور الجزيئي، مثل توزيع تأثيرات الطفرات وديناميكيات النقاط الساخنة لإعادة التركيب.

محتوى GC؛ الحجم السكاني الفعال؛ النقاط الساخنة لإعادة التركيب؛ الجينوم السكاني

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Cuantificación a escala fina de la intensidad de conversión de genes sesgada por GC en mamíferos

La conversión genética sesgada por GC (gBGC) es una fuerza evolutiva molecular que favorece los alelos GC sobre los AT, independientemente de su efecto de aptitud. Cuantificar la variación de su intensidad en el tiempo y entre genomas es clave para interpretar adecuadamente los patrones de evolución molecular. En particular, la literatura existente no es clara con respecto a la relación entre la fuerza de gBGC y el tamaño efectivo de la población de la especie, Ne. Aquí analizamos el patrón de sustitución de nucleótidos en secuencias codificantes de especies de mamíferos estrechamente relacionadas, accediendo así a un mapa de alta resolución de la intensidad de gBGC. Nuestro enfoque de máxima probabilidad muestra que gBGC es omnipresente, altamente variable entre especies y genes, y tiene una fuerte correlación positiva con Ne en los mamíferos. Estimamos que gBGC explica hasta el 60% de la cantidad total de sustituciones sinónimas de AT→GC. Mostramos que el análisis a escala fina de las sustituciones de nucleótidos inducidas por gBGC tiene el potencial de informar sobre diversos aspectos de la evolución molecular, como la distribución de los efectos de aptitud de las mutaciones y la dinámica de los puntos críticos de recombinación.

contenido de GC; tamaño efectivo de la población; puntos críticos de recombinación; genómica de poblaciones

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Quantification à échelle fine de l'intensité de la conversion génique biaisée par GC chez les mammifères

La conversion génique biaisée en GC (gBGC) est une force évolutive moléculaire qui favorise les allèles GC par rapport aux allèles AT, quel que soit leur effet sur la condition physique. Quantifier la variation de son intensité dans le temps et entre les génomes est essentiel pour interpréter correctement les modèles d’évolution moléculaire. En particulier, la littérature existante n’est pas claire en ce qui concerne la relation entre la force du gBGC et la taille effective de la population de l’espèce, Ne. Ici, nous avons analysé le modèle de substitution nucléotidique dans les séquences codantes d'espèces de mammifères étroitement apparentées, accédant ainsi à une carte haute résolution de l'intensité du gBGC. Notre approche du maximum de vraisemblance montre que gBGC est omniprésent, très variable selon les espèces et les gènes, et que sa force est positivement corrélée à Ne chez les mammifères. Nous estimons que gBGC explique jusqu'à 60 % du montant total des substitutions synonymes AT→GC. Nous montrons que l'analyse à échelle fine des substitutions de nucléotides induites par gBGC a le potentiel de renseigner sur divers aspects de l'évolution moléculaire, tels que la distribution des effets de fitness des mutations et la dynamique des points chauds de recombinaison.

Contenu GC ; taille effective de la population ; points chauds de recombinaison ; génomique des populations

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

स्तनधारियों में जीसी-पक्षपाती जीन रूपांतरण तीव्रता का सूक्ष्म पैमाने पर परिमाणीकरण

जीसी-पक्षपाती जीन रूपांतरण (जीबीजीसी) एक आणविक विकासवादी बल है जो उनके फिटनेस प्रभाव के बावजूद एटी एलील्स पर जीसी का पक्ष लेता है। समय और इसकी तीव्रता के जीनोम में भिन्नता की मात्रा निर्धारित करना आणविक विकास के पैटर्न की उचित व्याख्या करने की कुंजी है। विशेष रूप से, मौजूदा साहित्य जीबीजीसी ताकत और प्रजातियों की प्रभावी जनसंख्या आकार, एनई के बीच संबंध के बारे में अस्पष्ट है। यहां हमने स्तनधारियों की निकट संबंधी प्रजातियों के कोडिंग अनुक्रमों में न्यूक्लियोटाइड प्रतिस्थापन पैटर्न का विश्लेषण किया, इस प्रकार जीबीजीसी की तीव्रता के उच्च रिज़ॉल्यूशन मानचित्र तक पहुंच प्राप्त की। हमारा अधिकतम संभावना दृष्टिकोण दर्शाता है कि जीबीजीसी व्यापक है, प्रजातियों और जीनों के बीच अत्यधिक परिवर्तनशील है, और इसकी ताकत स्तनधारियों में एनई के साथ सकारात्मक रूप से सहसंबद्ध है। हमारा अनुमान है कि जीबीजीसी पर्यायवाची एटी→जीसी प्रतिस्थापन की कुल मात्रा का 60% तक समझाता है। हम दिखाते हैं कि जीबीजीसी-प्रेरित न्यूक्लियोटाइड प्रतिस्थापन के बारीक पैमाने के विश्लेषण में आणविक विकास के विभिन्न पहलुओं पर जानकारी देने की क्षमता है, जैसे उत्परिवर्तन के फिटनेस प्रभावों का वितरण और पुनर्संयोजन हॉटस्पॉट की गतिशीलता।

जीसी-सामग्री; प्रभावी जनसंख्या आकार; पुनर्संयोजन हॉटस्पॉट; जनसंख्या जीनोमिक्स

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

哺乳類における GC に偏った遺伝子変換強度の詳細な定量化

GC 偏った遺伝子変換 (gBGC) は、適応度効果に関係なく、AT 対立遺伝子よりも GC を優先する分子進化の力です。分子進化のパターンを適切に解釈するには、時間的およびゲノム全体にわたるその強度の変動を定量化することが重要です。特に、既存の文献では、gBGC 強度と種の有効個体群サイズ Ne との関係が不明瞭です。ここで我々は、哺乳類の近縁種のコード配列におけるヌクレオチド置換パターンを分析し、gBGC の強度の高解像度マップにアクセスしました。我々の最尤法アプローチは、gBGC が蔓延しており、種や遺伝子間で非常に変動しており、哺乳類の強度は Ne と正の相関があることを示しています。 gBGC は同義の AT→GC 置換の総量の最大 60% を説明すると推定しています。私たちは、gBGC によって誘導されるヌクレオチド置換の詳細な解析により、突然変異の適応度効果の分布や組換えホットスポットの動態など、分子進化のさまざまな側面について情報を得る可能性があることを示します。

GC コンテンツ。有効人口規模。組換えホットスポット。集団ゲノミクス

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Quantificação em escala fina da intensidade de conversão gênica influenciada por GC em mamíferos

A conversão genética tendenciosa por GC (gBGC) é uma força evolutiva molecular que favorece o GC em vez dos alelos AT, independentemente do seu efeito de aptidão. Quantificar a variação de sua intensidade no tempo e entre os genomas é fundamental para interpretar adequadamente os padrões de evolução molecular. Em particular, a literatura existente não é clara quanto à relação entre a força do gBGC e o tamanho efetivo da população da espécie, Ne. Aqui analisamos o padrão de substituição de nucleotídeos em sequências codificantes de espécies intimamente relacionadas de mamíferos, acessando assim um mapa de alta resolução da intensidade de gBGC. Nossa abordagem de máxima verossimilhança mostra que o gBGC é difundido, altamente variável entre espécies e genes, e de força positivamente correlacionada com Ne em mamíferos. Estimamos que o gBGC explica até 60% da quantidade total de substituições sinônimas de AT → GC. Mostramos que a análise em escala detalhada das substituições de nucleotídeos induzidas por gBGC tem o potencial de informar sobre vários aspectos da evolução molecular, como a distribuição dos efeitos de aptidão das mutações e a dinâmica dos hotspots de recombinação.

Conteúdo GC; tamanho efetivo da população; pontos de acesso de recombinação; genômica populacional

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Мелкомасштабная количественная оценка интенсивности конверсии генов, обусловленной GC, у млекопитающих

Конверсия генов, обусловленная GC (gBGC), — это молекулярная эволюционная сила, которая отдает предпочтение GC перед аллелями AT независимо от их эффекта приспособленности. Количественное изменение его интенсивности во времени и в разных геномах является ключом к правильной интерпретации закономерностей молекулярной эволюции. В частности, в существующей литературе нет ясности относительно взаимосвязи между силой gBGC и эффективным размером популяции вида Ne. Здесь мы проанализировали паттерн нуклеотидных замен в кодирующих последовательностях близкородственных видов млекопитающих, получив таким образом доступ к карте интенсивности gBGC с высоким разрешением. Наш подход максимального правдоподобия показывает, что gBGC широко распространен, сильно варьирует среди видов и генов, а его сила положительно коррелирует с Ne у млекопитающих. По нашим оценкам, gBGC объясняет до 60% общего количества синонимичных замен AT→GC. Мы показываем, что мелкомасштабный анализ нуклеотидных замен, индуцированных gBGC, может дать информацию о различных аспектах молекулярной эволюции, таких как распределение эффектов приспособленности мутаций и динамика горячих точек рекомбинации.

GC-контент; эффективная численность населения; горячие точки рекомбинации; популяционная геномика

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

哺乳动物中 GC 偏倚基因转换强度的精细量化

GC 偏向基因转换 (gBGC) 是一种分子进化力量，无论其适应性效果如何，它都倾向于 GC 而非 AT 等位基因。量化时间和基因组间强度的变化是正确解释分子进化模式的关键。特别是，现有文献尚不清楚 gBGC 强度与物种有效种群规模 Ne 之间的关系。在这里，我们分析了密切相关的哺乳动物物种的编码序列中的核苷酸取代模式，从而获得了 gBGC 强度的高分辨率图。我们的最大似然方法表明，gBGC 是普遍存在的，在物种和基因之间差异很大，并且其强度与哺乳动物中的 Ne 呈正相关。我们估计 gBGC 解释了同义 AT→GC 替换总量的 60%。我们表明，对 gBGC 诱导的核苷酸取代进行精细分析有可能为分子进化的各个方面提供信息，例如突变的适应度效应的分布和重组热点的动态。

GC含量；有效人口规模；重组热点；群体基因组学

Submission: posted 25 May 2021
Recommendation: posted 01 October 2021, validated 07 October 2021

Cite this recommendation as:
Mugal, C. (2021) A systematic approach to the study of GC-biased gene conversion in mammals. Peer Community in Genomics, 100012. https://doi.org/10.24072/pci.genomics.100012

Recommendation

The role of GC-biased gene conversion (gBGC) in molecular evolution has interested scientists for the last two decades since its discovery in 1999 (Eyre-Walker 1999; Galtier et al. 2001). gBGC is a process that is associated with meiotic recombination, and is characterized by a transmission distortion in favor of G and C over A and T alleles at GC/AT heterozygous sites that occur in the vicinity of recombination-inducing double-strand breaks (Duret and Galtier 2009; Mugal et al. 2015). This transmission distortion results in a fixation bias of G and C alleles, equivalent to directional selection for G and C (Nagylaki 1983). The fixation bias subsequently leads to a correlation between recombination rate and GC content across the genome, which has served as indirect evidence for the prevalence of gBGC in many organisms. The fixation bias also produces shifts in the allele frequency spectrum (AFS) towards higher frequencies of G and C alleles.

These molecular signatures of gBGC provide a means to quantify the strength of gBGC and study its variation among species and across the genome. Following this idea, first Lartillot (2013) and Capra et al. (2013) developed phylogenetic methodology to quantify gBGC based on substitutions, and De Maio et al. (2013) combined information on polymorphism into a phylogenetic setting. Complementary to the phylogenetic methods, later Glemin et al. (2015) developed a method that draws information solely from polymorphism data and the shape of the AFS. Application of these methods to primates (Capra et al. 2013; De Maio et al. 2013; Glemin et al. 2015) and mammals (Lartillot 2013) supported the notion that variation in the strength of gBGC across the genome reflects the dynamics of the recombination landscape, while variation among species correlates with proxies of the effective population size. However, application of the polymorphism-based method by Glemin et al. (2015) to distantly related Metazoa did not confirm the correlation with effective population size (Galtier et al. 2018).

Here, Galtier (2021) introduces a novel phylogenetic approach applicable to the study of closely related species. Specifically, Galtier introduces a statistical framework that enables the systematic study of variation in the strength of gBGC among species and among genes. In addition, Galtier assesses fine-scale variation of gBGC across the genome by means of spatial autocorrelation analysis. This puts Galtier in a position to study variation in the strength of gBGC at three different scales, i) among species, ii) among genes, and iii) within genes. Galtier applies his method to four families of mammals, Hominidae, Cercopithecidae, Bovidae, and Muridae and provides a thorough discussion of his findings and methodology.

Galtier found that the strength of gBGC correlates with proxies of the effective population size (Ne), but that the slope of the relationship differs among the four families of mammals. Given the relationship between the population-scaled strength of gBGC B = 4Neb, this finding suggests that the conversion bias (b) could vary among mammalian species. Variation in b could either result from differences in the strength of the transmission distortion (Galtier et al. 2018) or evolutionary changes in the rate of recombination (Boman et al. 2021). Alternatively, Galtier suggests that also systematic variation in proxies of Ne could lead to similar observations. Finally, the present study reports intriguing inter-species differences between the extent of variation in the strength of gBGC among and within genes, which are interpreted in consideration of the recombination dynamics in mammals.

References

Boman J, Mugal CF, Backström N (2021) The Effects of GC-Biased Gene Conversion on Patterns of Genetic Diversity among and across Butterfly Genomes. Genome Biology and Evolution, 13. https://doi.org/10.1093/gbe/evab064

Capra JA, Hubisz MJ, Kostka D, Pollard KS, Siepel A (2013) A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes. PLOS Genetics, 9, e1003684. https://doi.org/10.1371/journal.pgen.1003684

De Maio N, Schlötterer C, Kosiol C (2013) Linking Great Apes Genome Evolution across Time Scales Using Polymorphism-Aware Phylogenetic Models. Molecular Biology and Evolution, 30, 2249–2262. https://doi.org/10.1093/molbev/mst131

Duret L, Galtier N (2009) Biased Gene Conversion and the Evolution of Mammalian Genomic Landscapes. Annual Review of Genomics and Human Genetics, 10, 285–311. https://doi.org/10.1146/annurev-genom-082908-150001

Eyre-Walker A (1999) Evidence of Selection on Silent Site Base Composition in Mammals: Potential Implications for the Evolution of Isochores and Junk DNA. Genetics, 152, 675–683. https://doi.org/10.1093/genetics/152.2.675

Galtier N (2021) Fine-scale quantification of GC-biased gene conversion intensity in mammals. bioRxiv, 2021.05.05.442789, ver. 5 peer-reviewed and recommended by Peer Community in Genomics. https://doi.org/10.1101/2021.05.05.442789

Galtier N, Piganeau G, Mouchiroud D, Duret L (2001) GC-Content Evolution in Mammalian Genomes: The Biased Gene Conversion Hypothesis. Genetics, 159, 907–911. https://doi.org/10.1093/genetics/159.2.907

Galtier N, Roux C, Rousselle M, Romiguier J, Figuet E, Glémin S, Bierne N, Duret L (2018) Codon Usage Bias in Animals: Disentangling the Effects of Natural Selection, Effective Population Size, and GC-Biased Gene Conversion. Molecular Biology and Evolution, 35, 1092–1103. https://doi.org/10.1093/molbev/msy015

Glémin S, Arndt PF, Messer PW, Petrov D, Galtier N, Duret L (2015) Quantification of GC-biased gene conversion in the human genome. Genome Research, 25, 1215–1228. https://doi.org/10.1101/gr.185488.114

Lartillot N (2013) Phylogenetic Patterns of GC-Biased Gene Conversion in Placental Mammals and the Evolutionary Dynamics of Recombination Landscapes. Molecular Biology and Evolution, 30, 489–502. https://doi.org/10.1093/molbev/mss239

Mugal CF, Weber CC, Ellegren H (2015) GC-biased gene conversion links the recombination landscape and demography to genomic base composition. BioEssays, 37, 1317–1326. https://doi.org/10.1002/bies.201500058

Nagylaki T (1983) Evolution of a finite population under gene conversion. Proceedings of the National Academy of Sciences, 80, 6278–6281. https://doi.org/10.1073/pnas.80.20.6278

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Evaluation round #2

DOI or URL of the preprint: https://www.biorxiv.org/content/10.1101/2021.05.05.442789v4

Version of the preprint: 4

Author's Reply, 20 Sep 2021

Thanks for a second round of reviewing and the positive outcome. There were three more comments:

- Line 88-90: "A total of 1,104,917 third codon position synonymous substitutions and 514,552 first or second codon position non-synonymous substitutions were called." Why there are more third codon positions than first+second codon positions?

-> Here we are talking about substitutions not positions. There indeed are more synonymous than non-synonymous substitutions in this (and most) data set(s). I did not modify the ms based on this suggestion.

- Line 88-90: "A total of 1,104,917 third codon position synonymous substitutions and 514,552 first or second codon position non-synonymous substitutions were called." Why there are more third codon positions than first+second codon positions?

-> Yes, trying to fit a formal model to the clustering pattern sounds like an interesting perspective, in part (although not very explicitely) covered by the discussion. Thanks for this. I did not modify the ms based on this suggestion.

- Line 283-285: "That said, not only Ne influences the variation of π: the mutation rate also matters. Among-species differences in per generation mutation rate, if any, should be taken into account for a better assessment of the b vs. Ne relationship". Maybe the author wants to back up this argument using and citing some of the results reported here: https://doi.org/10.1093/gbe/evab150

-> Good suggestion; this interesting paper is now cited.

Best regards,

Nicolas Galtier

https://doi.org/10.24072/pci.genomics.100088.ar2

Decision by Carina Farah Mugal, posted 15 Sep 2021

Dear Nicolas Galtier,

I am pleased to inform you that all three reviewers and I found that your revisions address all the earlier concerns and that your revised manuscript is in principle ready for recommendation. Only one reviewer has some very minor suggestions, which I think will be straight forward to address.

Best wishes,

Carina Farah Mugal

https://doi.org/10.24072/pci.genomics.100088.d2

Reviewed by Fanny Pouyet , 08 Sep 2021

I have reviewed the paper entitled "Fine-scale quantification of GC-biased gene conversion intensity in mammals." by Nicolas Galtier.

The revised version of the manuscript encompasses all of the remarks I had and even more. I especially enjoyed the rewriting of the methods and the explanation of the weighted AIC which help the reader to fully understand the study. The discussion has been extended regarding very interesting suggestions from other reviewers.

I have no further comments.
Fanny Pouyet
It is my standard policy to sign my reviews (see round 1 for motivation)

https://doi.org/10.24072/pci.genomics.100088.rev21

Reviewed by anonymous reviewer 1, 24 Aug 2021

The authors did a good job at improving the manuscript. My comments have been addressed and I think the manuscript can be accepted in the current version.
Specifically, the authors have:
- Clarified the values for certain parameters used and added further description on the simulation methods sections.
- Added new analyses for life-history traits.
- Improved the discussion.
- Addressed editing suggestions in figures and text.
- Added a supplementary table which summarised the results and model parametrisation (as suggested by another reviewer).

https://doi.org/10.24072/pci.genomics.100088.rev22

Reviewed by David Castellano, 01 Sep 2021

In this second version of the manuscript, Galtier has addressed all my questions and comments. The manuscript was already good but now it clarifies some parts that were a bit obscure in the original version. I like the discussion, it proposes multiple future lines of investigation. I just have three minor comments:

> Line 88-90: "A total of 1,104,917 third codon position synonymous substitutions and 514,552 first or second codon position non-synonymous substitutions were called." Why there are more third codon positions than first+second codon positions?

> I find that ABC might be a more formal (but also computationally intensive) way of assessing the amount of clustering needed to replicate Moran's I statistic. However, given the simplicity of the simulations I believe the current approach might be already yielding accurate estimates.

> Line 283-285: "That said, not only Ne influences the variation of π: the mutation rate also matters. Among-species differences in per generation mutation rate, if any, should be taken into account for a better assessment of the b vs. Ne relationship". Maybe the author wants to back up this argument using and citing some of the results reported here: https://doi.org/10.1093/gbe/evab150

https://doi.org/10.24072/pci.genomics.100088.rev23

Evaluation round #1

DOI or URL of the preprint: https://www.biorxiv.org/content/10.1101/2021.05.05.442789v3

Version of the preprint: 3

Author's Reply, 23 Jul 2021

Download author's reply https://doi.org/10.24072/pci.genomics.100088.ar1

Decision by Carina Farah Mugal, posted 11 Jul 2021

Dear Nicolas Galtier,

Three reviewers have now read your manuscript "Fine-scale quantification of GC-biased gene conversion intensity in mammals". All three reviewers and I find the manuscript well-written, and results and method-contribution relevant and interesting. Nevertheless, all three reviewers also provide valuable suggestions for improvement, which I think will help to improve the quality and clarity of the study. Briefly,

(1) All three reviewers ask for more clarity of Method and Figure description. You will find specific comments and suggestions in the respective review reports.

(2) Reviewer #3 suggests an alternative hypothesis to explain the weak clustering of WS substitutions within Hominadae genes, which I find very interesting. This reviewer also suggests an hypothesis to explain observations of the RSD analysis, which I think could be worth exploring.

(3) Personally, I am wondering how the contribution of ancestral and lineage-specific polymorphisms could bias the estimation of substitution rates in closely related species (see e.g. doi: 10.1093/molbev/msz203)?

Besides, the author might find this reference on DSB evolution in mice useful in relation to their own observations in mice (doi: 10.1093/molbev/mst154).

I look forward to receiving a revised version of the manuscript.

Best regards,

Carina Farah Mugal

https://doi.org/10.24072/pci.genomics.100088.d1

Reviewed by Fanny Pouyet , 14 Jun 2021

In the manuscript entitled "Fine-scale quantification of GC-biased gene conversion intensity in mammals." by Nicolas Galtier (DOI https://www.biorxiv.org/content/10.1101/2021.05.05.442789v3), the author investigates how to measure gBGC strength in 4 clades of mammals (Hominidae, Cercopithecidae, Bovidae and Muridae).

The study is well designed as it compares observed statistics such as Moran's I or substitution density to simulations with nested/increasing number of parameters. The study uses a maximum-likelihood approach to reject the simplest models as it should be done. I don't have comments concerning the experimental design. The analysis and interpretation of results are clear as well. The flaws of this manuscript rely on the text or figure's legend which do not always help the reader to properly understand what was done and why. The presence of the scripts to make figures helped me doing the review.

Here are the specific comments:

1. Fig1: Please explicit in the legend what are the green branches (I assumed it is the studied branches but I'm unsure).

2. Fig2: I missed out why you need branches with at least 100 genes with each of them having at least 3 substitutions of each type. I would have expected this criterion will emphesize any signal of clustering. If you are willing to test whether there is a cluster, shouldn't you keep branches where there is a minimal number of substitution to have a signel (here, 300) regardeless they are well dispersed accross genes?

3. In several figures you wrote there is 1 dot per species. I thought you were also looking at internal branches. Do you calculate Moran's I solely at tips or in all the branches suitable for the study? Specifically, sup fig 1: you said 1 dot = 1 species but there are 8 dots in Bovidae and only 7 species. Moreover, I see only 7 blue dots and 8 red ones. Is it possible 2 blues dots are overlapping on the right? You could add a darker perimeter for the circles such that we see if 2 overlaps or not.

4. Supp Fig 1: Why is there differences between line blue and red in the simulations? Given the equations, I thought we would estimate the exact same value of B for both SW and WS substitutions.

5. Table S1: Please explicit what is the number of genes_cleaned. Given the explanation in the mat and methods (X->Y substitutions : All the descendants must be X on one side of the tree and Y on the rest), I don't get why we don't have the same number of genes within a clade.

6. In general, many sentences are long which does not help the reader. Please rewrite at least the sentence line 349-352. It is too long and the idea is complex (I had to re-read it 6 times before getting to the point).

7. Do you have an idea why there is an outlier in Bovidae at 3.86. If I get it right it is the capra-ovis ovarie branch (see supp table). It has a huge sd (~6). Do you think it is because there is no info, few substitution and so it is difficult to estimate a signal or do you think it is a biological signal, like a huge hotspot of gBGC intensity along that branch? I would have been interested to read about it in the discussion part (like if it's biological, is it realted to anything know about their evolutionary history?).

8. Models are complex and while they are well explained in the methods, I think you could help the reader by making a table summarizing the models. I had to dig to understand why f or z in models’ names. In the same topic, could you make a table summarizing how many branches are rejected per model ?

9. Which branch is not rejected using M3sh compared to M3h ? Is it in humans where there is small gBGC or is it somewhere else ? Do you have any comments to make on this branch ?

10. What do you mean by averaging the Akaike Information Criterium of each model in practice? I couldn’t find it in the scripts and I am interested in understanding what you did there.

10b. In equation 16 and 17. What is « k »: the genes or the AIC values ?

11. Fig4 : I think you represent the correlation of B and dN/dS using a log transformed scale (but the y axis and x axis values are still the values of B and dN/dS). The title is misleading.

12. Fig5 : The second sentence in the legend is unclear : « in for ».

I sign my review to increase the transparency of that process.

https://doi.org/10.24072/pci.genomics.100088.rev11

Reviewed by anonymous reviewer 1, 23 Jun 2021

In this work, N. Galtier estimated the strength of gBGC and investigated its relationship with Ne acrosss 4 different families of mammals. To do this, he analysed nucleotide substitution patterns in coding sequences of 40 mammalian lineages using a maximum likelihood approach. The results of the study suggest that gBGC is prevalent in these mammalian families, estimating that large proportion of WS synonymous substitution can be attributed to this process and that its strenght varies across lineages and genes depending on Ne and the dynamics of recombination hotspots.

This work joins a large body of literature that demonstrates that gBGC is a major force shaping patterns of molecular evolution. The article is well writen and I enjoyed reading it. The potencial limitation that came to mind while I reviewd the study where properly discussed, such as the fact that these results are dependent on assuming a constant mutational process across species. So, I do not have major comments for improving this manuscript.

In terms of novelty of the work, other studies have tried to estimate B accross mammalian lineages. However, most studies have estimated B from site frequency spectra. Few studies, which are cited in this manuscript, have already tried to estiamate B using substitution patterns. Specifically, Lartillot (2012) proposed an integrated Bayesian model for reconstructing the evolutionary history of gBGC, and for estimating its correlation with life-history and karyotypic traits. Nonetheless, this maximum-likelohood framework is an anlternative model that confirmed many previous studies and seems very valuable for the research field and community.

Minor comments:

line 102: I suggest editing: "As far as SW substitutions were concerned, " to "The centered Moran’s I for WS substitutions "

line 110: Why is there a discrepancy between the the bp scales used by the author when calculating Moran's I (400 bp) and the one used in the simulation (40 and 500 bp?) If the aim is to assess the amount of clustering needed to explain the observed values, real and simulatated data should have the same bp scale.

Figure 4: The sample size was here too small to investigate the within-family relationships. To further investigate the relationship of B and Ne, the author could show if there is a correlation betnween B and Ne-related life history traits and assess this within-family relationship. This would help strengthening the argument given that even this relationship (putative correlation between B and Ne) as judged by the correlation between B and dN/dS is weakly convincing as there are few data points within each family. Moreover, the family with the largest number of lineages is the one that shows no significant relationship.

Supplementary Figure 1: It was difficult to understand this plot.
It is not clear if numbers in red are shared between Bovidae and Muridae or if they are missing for Bovidae panel (same for the two upper panels).
The last sentence of the legend: "accounting for substitutions that were lost because appearing within introns or flanking regions." It is not clear in the methods how these were accounted for.

Section 5.4 Need clarification. Contrary to the rest of the manuscript, this part was not clear.

line 344: It is uncelar what does the author mean by "randomly sample the location of the first substitution " What substitution? For the first substitution in a 4 species alignment?. For this the location in the hypothetical branch was randomly sampled ? It is also unclear what do these authors mean by "across genes and exons;" was this done once for genes (incudng introns) and once for exons? (I assume this has to do with my previous question for Supp Fig.1 so some clarification hee is needed).

line 360: "Two parameters of the simulation procedure were varied among conditions, namely the per third codon position density of substitutions, and the probability pclust for two successive substitutions". It is not clear in the text what where the values for these parameters accross different similations.

line 399: Empirical estimates of mutation rate in humans were used. It is possible that mutation rates vary between the investigated taxon families. Could a real difference in mutation rates between lineage lead to the observed patterns attributed to differences in B?

Figure 4 and 5 could be placed in the appropiate sections.

https://doi.org/10.24072/pci.genomics.100088.rev12

Reviewed by David Castellano, 01 Jul 2021

In this manuscript, Galtier quantifies the strength of GC-biased gene conversion (gBGC) and its impact on protein evolution in 4 families and 32 species of mammals. He founds a substantial impact of gBGC on AT > GC synonymous substitutions (explaining ~60% of the variance). I've divided this revision into 4 sections.

1. Is the science sound, with a logical narrative and well-supported results and conclusions?

The manuscript follows a logical narrative and the methods are sound. The literature context provided in the introduction is very helpful. However, there is a key question regarding the interpretation of an important result that should be addressed before recommendation: I agree that if recombination hotspots are more ephemeral in Hominadae than in other groups then this could explain the weak clustering of WS substitution within genes. However, there is another alternative hypothesis. Could the weak clustering of WS substitutions within Hominadae genes be due to their lower diversity? The more distant the segregating sites are, the less likely would be for gBGC to generate a cluster of substitutions in a given gene. Is there a correlation between heterozygosity at the gene level and Moran’s I for WS substitutions? Hence, maybe the substitution clustering within genes is conditional on B intensity + gene heterozygosity + (local & global) Ne. I am also assuming that the gene conversion tract length is not negatively correlated to the Ne. I am not sure there is literature regarding the correlation between gene conversion tract lengths and Ne.

Ideally, genes' heterozygosity and Ne should be decoupled to assess this hypothesis (by comparing genes with different mutation rates within a genome?), which is hard. But maybe this alternative can be further discussed or elaborated by the author.

2. Is there enough info to allow verifying and reproducing the data?

The supplementary information, plus the scripts, are easy to access and rerun.

3. Are there obscure passages that a potential reader can’t go through?

So far the paper is easy to follow, but of course, there are always things that can be clarified. For example:

3.1 It would be good to have a table (in the main text?) with all five models (M1, M2, M3z, M3h, and M3sh), their number of parameters, and the average lnL across species. I can not find in supplementary table 1 the info regarding model M3h and the p-value of the LRTs commented in the main text.

3.2 I don't quite understand model M3sh. If q (Is q equivalent to the number of hotspots per gene?) approaches zero then does this mean that there are no hotspots within genes? or that hotspots occur in a very tiny fraction of the gene? Maybe the definition of model M3sh can be extended or rephrased?

3.3 Line 135-136. "These were very similar to estimates obtained by averaging B̄ across 136 the M1, M2, M3z, M3sh, and M3h models, weighting by the AIC of each model." Could it be possible to add to supplementary table 1 the AIC weights too? and have a supplementary figure equivalent to figure 3 but with the AIC weighted parameters? Just to back up this sentence with figures and tables.

4. Potential extra analysis only if interesting enough to the recommender and/or author:

4.1 Regarding the across genes RSD analysis. Is the recombination map in Muridae also more uniform than in Bovidae, Hominidae, and Cercopithecidae? That could explain the results, but I understand that the recombination map for all these groups might not be available.

4.2 Again regarding the clustering of mutations within genes. Would it be possible to assess whether most WS clustering is happening at first exons (the ones closest to CpG islands)? As far as I know, at least in humans, recombination hotspots tend to occur at CpG islands at the starting of genes.

4.3 Relative to the genome-wide excess of WS mutations due to gBGC. Would it be possible to estimate the defect of SW mutations too? In other words, it would be interesting to know the overall impact of gBCG on substitution rate taking into account that the absolute number of WS and SW substitutions might be different? Maybe controlling by GC conservative substitutions across species?

https://doi.org/10.24072/pci.genomics.100088.rev13