Schiffer et al present a genome assembly, annotation, and comparative analysis of a representative of Xenoturbellida, perhaps the most evolutionarily interesting (and controversy-hounded) lineage of Bilateria, owing to its relatively simple gross morphology and uncertain phylogenetic position. They demonstrate a reasonably gene-space complete primary assembly and annotation of this small genome, and using HiC libraries scaffold roughly 3/5ths of the assembly into 18 linkage groups showing high levels of macrosynteny conservation with non-bilaterians and a representative deuterostome. A comprehensive orthology analysis shows, perhaps surprisingly regardless of the phylogenetic position of this lineage, a relatively Bilaterian-like gene content from the perspective of orthologue occupancy and signaling/transcription factor machinery, albeit showing a slightly higher than average loss of ancestral bilaterian orthologs (en par, I was surprised to see, with Hofstenia miamia, a representative of the acoelomorph sister lineage). A gene presense/absence phylogeny made with a different orthology declaration method and reduced taxon set shows strong support for a Xenambulacraria topology, even while splitting Xenacoelomorpha. Such an analysis is only possible with the whole-genome data presented here, and represents a refreshingly different and still somewhat novel approach to tackle this difficult phylogenetic problem to the familiar sequence alignment based inference methods, about which much has been published elsewhere. Numerous other "small" analyses (which I'm sure represent months of work in many cases), e.g. of miRNA content, neuropeptide complement, homeobox gene organization, phylostratigraphy, and symbiont genomics are presented, which shed light on many aspects of xenoturbellidan biology - doubtless this manuscript will help solidify our understanding of this enigmatic lineage, and stimulate deeper study in some unexpected areas. The phylostratigraphically anomalous & sparsely methylated chromosome is particularly interesting.
There are a few apparent weaknesses of the manuscript. It's evident that these data were generated some time ago, and that the technologies used to generate the primary assembly are now basically obsolete - I'm sure that 1/3 of a Sequel II flow cell or a single MinION flow cell could generate a much more contiguous (and probably somewhat more complete) assembly of this genome with much less bioinformatic acrobatics these days. This said, I think the authors demonstrate convincingly that for the specific analyses shown in this paper, focusing on coding gene content and a birds' eye view of macrosynteny conservation, this assembly is adequate to the task at hand, and a reviewer shouldn't ask for more than that. This said, I would not present this as a "high quality genome" by today's standards - it is fundamentally a highly scaffolded Illumina genome which was just about contiguous enough to further scaffold to a pseudochromosomal stage with HiC data.
Obviously, one of the major uses of the genome will be in providing new evidence for the phylogenetic position of this lineage. I think the strength of the gene presence/absence phylogeny based on whole genomes assigned to OMA orthogroups speaks for itself, and I have no particular qualm with the authors' methods or interpretation of these results. However I did find it strange that this mode of phylogeny-building was not explored for the taxonomically much larger orthogroup assignment done using Orthofinder. True there can be failures to detect true gene presence in transcriptomes, and the acoel transcriptomes that exist vary quite a bit in quality, but the cynic in me did wonder whether such analyses were conducted and not presented because it yielded results incompatible with the authors' previous body of work on this phylogenetic problem. Some further justification of this decision, in any case, seems appropriate.
By far the most glaring problems with the manuscript are in its method section and overall transparency/reproducibility. Almost all of the primary data used to generate these results was not made available during review so that even basic sanity checks e.g. through a k-mer analysis of genome size & heterozygosity were not possible. Numerous basic reports e.g. on library quality and assembly statistics in various stages of the assembly pipeline were not presented. Important analyses are alluded to but not shown (e.g. blobplots, de novo transcriptome assembly statistics/completeness). Several clear factual errors are apparent (e.g. in the instrument used to generate the core assembly), and where both lab and bioinformatic protocols are remarked on, they are often presented with such a low level of detail to as to forbid reproducibility. Indeed, many data types which were used for various small analyses (e.g. bisulfite sequencing, ONT sequencing) are not mentioned at all in the methods or supplement. I've given a fairly detailed account of where I see the absences in the notes below. For the most part, I have confidence in the quality of the datasets used to underpin this work, which was doubtless a lot of labor over many years, done the lab of a well established research leader in this field. I also do realize that these are *lots* of different experiments, and some of the data types are now no longer even on the market (e.g. TSLR). However, all published scientific literature should hold itself to a basic standard of transparency and reproducibility, which I would say this manuscript in its current form does not meet.
Detailed notes on introduction:
It seems the authors have made a choice not to cite any of the early molecular work plagued by contamination with molluscan gut contents. Follow up note: have the authors themselves done any screening for molluscan DNA?
From "line 70" - the authors refer to "a majority of studies" but cite only one (Cannon et al) - perhaps other citations are needed here?
From line 87 "The loss of...": A bit of a strange review - I'm not sure a barnacle or urochordate or neodermatan morphologist would characterize their study systems as morphologically simple. And is neoteny really a "new mode of living"? - I think of it as a hypothetical model of evolutionary change. I would have less issue with a statement to the effect that major ecological transitions are often accompanied by major morphological shifts, including loss of "bodyplan" level features and organ systems.
Line 103 "The only Xenacoelomorpha genomes available...": this is now out-of-date, with the preprint on the Symsagittifera roscoffensis genome, which is albeit very closely related to Praesagittifera. https://www.biorxiv.org/content/10.1101/2022.08.27.505549v1.full
Detailed notes on Results:
The difference in size between the primary assembly (121M) versus the final assembly (117M) suggests that very little sequence was removed by redundans as haplotypic duplication - is this correct? Was the genome relatively homozygous, e.g. as judged by kmer content?
I find the interpretation of false-negative orthology detection due to fast rates of sequence substitution leading to a splitting of Xenacoelomorpha in the p/a phylogeny quite credible, actually. There was an interesting paper recently published that looked at rates of false negative orthology detection and showed this to be a pervasive problem in taxonomic lineages that are poorly sampled and/or fast-evolving: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000862
I would be interested to know how many lineage-specific gene births and losses are recovered for Xenoturbella+Ambulacraria in this presence-absence analysis. Does the taxon-restricted gene set have any particular characteristics - e.g. average alignment length, compositional bias, distribution throughout the genome? If they are "perfectly normal" genes it strengthens the argument that this relationship is unlikely to be an artifact. I would be particularly interested in knowing if any of the "Xenambulacrarian" genes are particularly enriched on c1896, which was a very striking outcome of your analysis.
Line 207: is Xenoturbella slow-evolving or fast-evolving? The title says one thing, here another. Perhaps a little more clarity on what precisely is meant by this is worthwhile.
A comment on BUSCO completeness. One of the authors (PHS) was kind enough to share the genome annotation and assembly with me for use in a classroom module, before I was asked to review this manuscript. By my own hand, the BUSCO completeness of the genome annotation against Metazoa odb10 in BUSCO5 (run in peptide mode) was:
Which is substantially lower than the 90% quoted in the manuscript - is this down to a difference in database versions or has something else happened?
Also, in comparison to the figures from the genome annotation (whether 82.5% or 90%), I have separately done some re-analyses of publicly available RNA-seq data from X. bocki generated by other groups. My trinity assemblies give figures like:
This to my mind does indicate there's some 40-50 metazoan BUSCOs present in adult RNA-seq data which are not represented in the genome annotation, potentially indicating some, although not a large level of, incompleteness of the genome assembly.
Line 211 "derived bilaterians" - In my view & in the views of many other systematists, no living species is more or less derived than any other - individual characters can be derived, but not whole organisms. I have a similar problem with reference to "early branching lineages" made elsewhere in the manuscript - from the perspective of cnidarians, for instance, mammals are a representative of an early-branching lineage. I am sure that the authors won't disagree with these fundamental points, but the language that we sometimes use to describe these phenomena I think does bias our thinking towards a ladder-like view of evolution, and I would prefer to see these organismic comparisons made with less hypothesis-laden descriptors.
Very interesting that the bilaterian orthogroups have a similar occupancy for both Xenoturbella and the supposedly much faster evolving Hofstenia!
I am missing in Figure 3a data from the acoelomorphs incorporated into this analysis. Obviously you cannot show all 155 species in an easily readable way, but as a comparative point, it does seem essential to compare the Xenoturbella species available to the acoelomorphs, particularly if you are trying to argue that this genome is especially faster or slower evolving than those from representatives its sister lineage.
For figures 3b/3c, are the trees at the bottom of the heatmaps dendrograms on the heatmap data, or are these schematic cladograms, and if so, from where do they originate?
I shall mostly refrain from commenting on the neuropeptide section as this is outside my expertise. I was curious, however - you see the K/R-RFP-K/R motif, which is reasonably compelling if anecdotal as a molecular synapomorphy, in Xenoturbella and in the ambulacrarians, but not in the nemertodermatid transcriptomes - is it also present in the reasonably complete transcriptome from X. profunda? Or any of the acoels?
On ancestral linkage groups. Indeed the conservation of macrosynteny visible with other metazoans is an impressive feature of this assembly, and a compelling demonstration of the success of this paper. I think it would help improve the readability of this manuscript if you could e.g. put a bold-line box around some of the clade-specific linkage groups you discuss in the oxford plots. Also, I don't understand the argumentation around "prebilaterian" linkage groups. Neither the Nephrozoa hypothesis nor the Xenambulacraria hypothesis posits that Xenacoelomorpha are non-bilaterians - why would the absence of a eumetazoan plesiomorphy say anything decisive about either hypothesis?
On the anomalous small chromosome: super interesting result, and indeed perhaps the start of some really interesting Xenoturbella-specific biology. I wonder if this will be seen to occur in any acoelomorph genomes going forward. I would be very keen to see if gene tree topologies/delta-likelihoods of orthologs occurring on this chromosome are any different on average to those occurring elsewhere in the genome - for instance indicating a potential horizontal origin (albeit you don't see a large signature of this in the global analysis of HGT). One other thing: I couldn't really find anything in the figures corresponding to the synteny with the E. muelleri scaffolds described in the text - could you make this clearer? Indeed I don't see the c1896 labelled on the dot plot with E. muelleri.
Detailed notes on Discussion:
The discussion of "intermediate" genomic traits such as miRNA counts and linkage group organization feels a bit phenetic to me. Surely all of the traits mentioned could be analyzed cladistically, in a search for synapomorphies. Simply intermediate numbers of various character states shouldn't be compelling on their own.
Is the relatively canonical gene content of Xenoturbella informative either way on the Xenambulacraria vs Nephrozoa debate? I'm not sure I agree that it is. I think for instance of the Dimorphilus genome which was recently published, showing many features we would expect a typical annelid genome to have, despite the highly reduced body size and morphological simplification of this lineage. This to me shows a decoupling between a birds' eye view of genome biology and morphology. So indeed, while the relatively bilaterian-gene rich genome of Xenoturbella is consistent with the Xenambulacraria hypothesis, it's not *inconsistent* with the Nephrozoa hypothesis either. I do like your phrasing of the "strong" Nephrozoa hypothesis not being supported - this does imply, however, that a "weak" Nephrozoa hypothesis is possible (presumably meaning a Nephrozoa true tree topology but little obvious genomic "pre-bilaterian-ness" as one might naively think if one interprets xenacoelomorph morphology as primitively simple and gene content as predictive of morphology). Indeed, you say as much in the final paragraph of the discussion - I think this measured reading is appropriate and commendable.
Sentence beginning line 461: I had thought that the long branch lengths seen for acoels in your presence/absence trees would indicate a high rate of acoelomorph-specific gene births, rather than a high rate of loss? Is it possible to disentangle these?
Another possibility about the anomalous chromosome: could this be a germline restricted chromosome? We do expect that these should have younger genes on average, and these are also usually small. Does it show a different level of average coverage in the raw reads to other chromosomes?
Detailed notes on Methods:
Which phenol-chloroform protocols and Qiagen kits were used for DNA extraction? In the results it's asserted that HMW gDNA was extracted - how was this ascertained/QC'd?
It's concerning to me that the authors state a 2x250 bp read format was used on the HiSeq 4000, as this platform does not offer that read length. Perhaps it was a HiSeq 2500 2x250 rapid run?
It would be good to see one of the blobplots referred to, to convince the reader that this really is an uncontaminated genome assembly, despite the efforts to starve the specimens before extraction. This is a part of the tree of life for which few close references exist and it can be tricky sometimes to judge the source of contaminants from blobplots on such species - perhaps better to show rather than tell.
I am missing here some basic statistics (perhaps best shown in a table) on these assemblies during various points in the process, e.g. right after SPAdes, after redundans, after BADGER. The authors cite a scaffold N50 of 60 kb before HiC scaffolding, but what is the contig N50 before any scaffolding?
Indeed, the authors refer to mate-pair libraries but do not give any details on the protocols used to generate these datasets, the size of the mate pairs, QC statistics...
Redundans should be cited.
I was unable to find a link to the raw reads (except for two HiC datasets) or assemblies used in this paper. I was hoping to do some basic analyses, e.g. kmer spectra, just to cross-check for instance that the assembled genome size matches the kmer estimated size, and to determine what proportion of kmers in the reads were not represented in the assembly. Without such data (which could have been uploaded to SRA and embargoed for public release) it's difficult to fully review this manuscript. I will note that the genomescope analyses I made of the two HiC datasets were somewhat concerning, with no visible spectra outside an error distribution - perhaps these are low-diversity libraries, or highly contaminated libraries?
Some concerns about the HiC protocol and data presented here. Fixing a whole animal vs fixing cryohomogenized tissue is likely to lead to poor results from autolysis as the fixative penetrates large volumes of tissue. There's no indication that the DpnII enzyme was heat-killed before proceeding to fill-in. Numerous volumes used (for formaldehyde and SDS, for example), and enzyme details (which "ligase"? what manufacturer?) are missing. It's not clear what protocol was used to prepare the extracted 3C DNA into an Illumina library, or how the biotin selection was performed. And most concerning of all, I can't really see any QC data on this library - at very least, the authors should be showing the pair-length distribution and the contact heatmaps which have become standard in the field, so that readers can judge how strong the evidence for the chromosome scaffolding is.
I think, as instaGRAAL is a published method, it's not necessary to explain its algorithm in detail here - just the parameters that were used to run it.
The protocols used for RNA extraction and cDNA library preparation should be specified in enough detail for another lab to reproduce this work. It would also be good to see some rough statistics on the Trinity assembly, so readers can judge its completeness and contiguity. Again, I could not find any RNA-seq reads used in this study uploaded to the SRA.
The authors refer to additional single-cell transcriptome data - if these were used in the annotation of this genome, surely the experiments used to derive these should also be described in the methods section and deposited into public databases?
Question: during setup for orthology inference, for the species for which RNA-seq data only were used as input, were *only* those genes with positive hits against UniProt/Pfam retained in the protein prediction, or was this simply used to improve the sensitivity of the predictions? I am wondering if this pipeline might exclude novel taxon-specific orthologs with no sequence similarity to existing databases.
I don't see any problem per se in the way that the gene presence/absence phylogenies were generated, but I am curious why the OMA algorithm, and apparently a separate species set, was employed in this while the Orthofinder analysis should also in principle be well-suited to this kind of analysis. Do the results differ with a larger taxon sample?
In the homeobox section, ONT reads are mentioned for the first time. There's no information given in the manuscript about the volume and quality of these data, and how they were generated. I also find it strange that these were used only in the context of homeodomain-containing contig analysis - why not also incorporate them into the primary SPAdes assembly?
Line 751 "We extracted a highly contiguous..." - how was this extraction performed bioinformatically?
Line 752 - As I understand it, LINKS is a scaffolder, not a polisher.
BUSCO is mentioned - including re-analyses of public data such as the Hofstenia genome - but the parameters/database versions used to run this software seem not to have been reported.
Similarly: there are some results on methylation reported in the supplement, but no mention is made of how these results were obtained - was this bisulfite sequencing? If so, how were these libraries generated and these analyses performed?