How to best call the somatic mosaic tree?
Somatic mutation detection: a critical evaluation through simulations and reanalyses in oaks
Recommendation: posted 25 October 2022, validated 08 November 2022
Bierne, N. (2022) How to best call the somatic mosaic tree?. Peer Community in Genomics, 100024. 10.24072/pci.genomics.100024
Any multicellular organism is a molecular mosaic with some somatic mutations accumulated between cell lineages. Big long-lived trees have nourished this imaginary of a somatic mosaic tree, from the observation of spectacular phenotypic mosaics and also because somatic mutations are expected to potentially be passed on to gametes in plants (review in Schoen and Schultz 2019). The lower cost of genome sequencing now offers the opportunity to tackle the issue and identify somatic mutations in trees.
However, when it comes to characterizing this somatic mosaic from genome sequences, things become much more difficult than one would think in the first place. What separates cell lineages ontogenetically, in cell division number, or in time? How to sample clonal cell populations? How do somatic mutations distribute in a population of cells in an organ or an organ sample? Should they be fixed heterozygotes in the sample of cells sequenced or be polymorphic? Do we indeed expect somatic mutations to be fixed? How should we identify and count somatic mutations?
To date, the detection of somatic mutations has mostly been done with a single variant caller in a given study, and we have little perspective on how different callers provide similar or different results. Some studies have used standard SNP callers that assumed a somatic mutation is fixed at the heterozygous state in the sample of cells, with an expected allele coverage ratio of 0.5, and less have used cancer callers, designed to detect mutations in a fraction of the cells in the sample. However, standard SNP callers detect mutations that deviate from a balanced allelic coverage, and different cancer callers can have different characteristics that should affect their outcomes.
In order to tackle these issues, Schmitt et al. (2022) conducted an extensive simulation analysis to compare different variant callers. Then, they reanalyzed two large published datasets on pedunculate oak, Quercus robur. The analysis of in silico somatic mutations allowed the authors to evaluate the performance of different variant callers as a function of the allelic fraction of somatic mutations and the sequencing depth. They found one of the seven callers to provide better and more robust calls for a broad set of allelic fractions and sequencing depths. The reanalysis of published datasets in oaks with the most effective cancer caller of the in silico analysis allowed them to identify numerous low-frequency mutations that were missed in the original studies.
I recommend the study of Schmitt et al. (2022) first because it shows the benefit of using cancer callers in the study of somatic mutations, whatever the allelic fraction you are interested in at the end. You can select fixed heterozygotes if this is your ultimate target, but cancer callers allow you to have in addition a valuable overview of the allelic fractions of somatic mutations in your sample, and most do as well as SNP callers for fixed heterozygous mutations. In addition, Schmitt et al. (2022) provide the pipelines that allow investigating in silico data that should correspond to a given study design, encouraging to compare different variant callers rather than arbitrarily going with only one. We can anticipate that the study of somatic mutations in non-model species will increasingly attract attention now that multiple tissues of the same individual can be sequenced at low cost, and the study of Schmitt et al. (2022) paves the way for questioning and choosing the best variant caller for the question one wants to address.
Schoen DJ, Schultz ST (2019) Somatic Mutation and Evolution in Plants. Annual Review of Ecology, Evolution, and Systematics, 50, 49–73. https://doi.org/10.1146/annurev-ecolsys-110218-024955
Schmitt S, Leroy T, Heuertz M, Tysklind N (2022) Somatic mutation detection: a critical evaluation through simulations and reanalyses in oaks. bioRxiv, 2021.10.11.462798. ver. 4 peer-reviewed and recommended by Peer Community in Genomics. https://doi.org/10.1101/2021.10.11.462798
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.
Reviewed by anonymous reviewer 1, 12 Sep 2022
Reviewed by anonymous reviewer 2, 23 Aug 2022
Evaluation round #1
DOI or URL of the preprint: https://www.biorxiv.org/content/10.1101/2021.10.11.462798v2
Author's Reply, 17 Aug 2022
Decision by Nicolas Bierne, posted 07 Jul 2022
Dear Dr. Schmitt,
I have received two thoughtful reviews of your preprint entitled “Somatic mutation detection: a critical evaluation through simulations and reanalyses in oaks”. The referees are globally positive and expressed mainly minor concerns though important to account for. I’d ask you to account for all these concerns in a revised version.
Basically both referees think that you interpreted as performance differences between callers what are indeed different purposes. If the objective is to identify somatic mutations in a low proportion of cells in the population of cells analyzed, it is clear that cancer callers are more designed to do that while generic/SNP callers aren’t (they are rather designed to assigned low VAF mutations to sequecing errors). So one part of the issue is not about some callers outperforming others but more about what we want to infer with these callers, and this should be made clearer. Some callers (eg Octopus) have different models depending if the user wants to call germline mutations, somatic mutations, or depending on the ploidy and you can even analysed poolseq data or low coverage data. You therefore have to make clearer that you are interested by datasets that consist of multiple tissues sampling of the same tree in order to detect somatic mutations and that it is different from standard SNP calling, and also different from standard cancer genomics experiments (referee 2 even suggested that ideally a caller should be designed or tuned for this type of data). Then you have the study of the performance of cancer callers to detect somatic mutations (under your given study design) that is very interesting to investigate. However, you should make clearer how you evaluated the robustness, in order to be sure you do not have circular reasoning (referee 1). You should clarify how sequencing errors and true low frequency somatic mutations can be sorted out in order to allow you to claim that Strelka2 outperforms other callers with true data. Is it more mutations the better or is it more subtle? You should also explore or discuss the effect of default parameters in the difference between callers (referee 2). At any rate, it would be better framing to argue that Strelka2 is better designed to the purpose and the data than to say it outperforms other callers. Mapping is also an important phase of the pipeline and it could be interesting to discuss that some studies used Bowtie and others BWA (referee 1).
I thank you to have submitted your preprint to PCI Genomics peer reviewing and I’m looking forward to reading your revised version.