A high quality reference genome of the brown hare
High quality genome assembly of the brown hare (Lepus europaeus) with chromosome-level scaffolding
Abstract
Recommendation: posted 19 January 2024, validated 24 January 2024
Hollox, E. (2024) A high quality reference genome of the brown hare. Peer Community in Genomics, 100299. 10.24072/pci.genomics.100299
Recommendation
The brown hare, or European hare, Lupus europaeus, is a widespread mammal whose natural range spans western Eurasia. At the northern limit of its range, it hybridises with the mountain hare (L. timidis), and humans have introduced it into other continents. It represents a particularly interesting mammal to study for its population genetics, extensive hybridisation zones, and as an invasive species.
This study (Michell et al. 2024) has generated a high-quality assembly of a genome from a brown hare from Finland using long PacBio HiFi sequencing reads and Hi-C scaffolding. The contig N50 of this new genome is 43 Mb, and completeness, assessed using BUSCO, is 96.1%. The assembly comprises 23 autosomes, and an X chromosome and Y chromosome, with many chromosomes including telomeric repeats, indicating the high level of completeness of this assembly.
While the genome of the mountain hare has previously been assembled, its assembly was based on a short-read shotgun assembly, with the rabbit as a reference genome. The new high-quality brown hare genome assembly allows a direct comparison with the rabbit genome assembly. For example, the assembly addresses the karyotype difference between the hare (n=24) and the rabbit (n=22). Chromosomes 12 and 17 of the hare are equivalent to chromosome 1 of the rabbit, and chromosomes 13 and 16 of the hare are equivalent to chromosome 2 of the rabbit. The new assembly also provides a hare Y-chromosome, as the previous mountain hare genome was from a female.
This new genome assembly provides an important foundation for population genetics and evolutionary studies of lagomorphs.
References
Michell, C., Collins, J., Laine, P. K., Fekete, Z., Tapanainen, R., Wood, J. M. D., Goffart, S., Pohjoismäki, J. L. O. (2024). High quality genome assembly of the brown hare (Lepus europaeus) with chromosome-level scaffolding. bioRxiv, ver. 3 peer-reviewed and recommended by Peer Community in Genomics. https://doi.org/10.1101/2023.08.29.555262
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.
This study belongs to the xHARES consortium funded by the R'Life initiative of the Academy of Finland, grant no. 329264.
Evaluation round #1
DOI or URL of the preprint: https://doi.org/10.1101/2023.08.29.555262
Version of the preprint: 2
Author's Reply, 18 Jan 2024
First, we would like to thank both reviewers for their constructive feedback, which has enabled us to improve the manuscript substantially.
Reviewer 1
*I have annotated the PDF with a range of comments and suggested edits, hoping that this will be useful for the authors to clarify and strengthen a few sections, incl. removal of some wording that feels rather subjective in places.*
Re: Thank you. These points should now be covered.
Reviewer 2
*Introduction
Line 52: bit it remains as native species (suggestion, but anything that helps the sentence flow better would be good)*
Re: Thank you, but a threat to the native species is meant here. We have now tried to make it more clear.
*Line 58: especially through the expansion...
Line 65: Are the mountain hare and the SA Cape hare the same species? If not, specify the species name of the mountain hare, as done for the SA Cape hare.
Extra note: Mountain hare is then mentioned in line 86 again, and the species name
specified after, it should instead be changed to the first occurrence (line 65).
Line 73: Maybe link better the previous sentence with: Identifying Poland as.. at first, it is unclear why only that scenario for type locality is explained.
Line 111: There is an ongoing discussion regarding the continuation of utilization of the terms second or third generation sequencing, as technologies are developing at such fast pace and very different technologies in parallel. I would suggest initiating that sentence by just saying, the technological development of high-throughput sequencing of long molecules, … Same on line 118 (change to long read sequencing)
Line 114: early 2000’s is very generic, and not clear what you mean by that, I would say that through a big part of most early 2010’s it was still also mostly genome assemblies based on short reads. (For example, PacBio RS did not start being fully commercial until 2011. Oxford Nanopore’s MinION until 2015)
Line 114: Same comment as before, about specifying second generation, I would just say short read sequencing technologies. Short read data is still in constant development and is still needed and useful for a lot of studies (including Hi-C sequencing, mentioned after).
Line 115: Instead of similarly, I would say additionally, as it is yet another technology that can be used supporting the others.
Line 115: Propelled these genomes -> it would be more accurate to say that has propelled the research on generation of genome assemblies and reference genomes
Line 117: Weird phrasing: I would rephrase to “Coupled with these advances, the decrease in price per base pair through the years has made whole genome sequencing available to many laboratories and research groups”
Line 120: these two technologies: specify there which long read technology, as it is the first time mentioning it in the main text.
Line 133: Even if it is quite clear by context, this paragraph is important and I would clarify again “a male specimen of brown hare”. The previous paragraph talks about different species and genome assemblies, so I think it is good to specify again that the study is on brown hare.*
Re: These points are now corrected or edited. Many thanks.
*Line 137-146: It is good to give final results on the final paragraph of the introduction, but I think it is too detailed, and most things should be left for the actual results and discussion sections. For example, no need to specify each BUSCO score here.
Line 142: I believe you mean L90 and not N90 there.*
Re: A more creative final paragraph is offered, which hopefully ties up better with the rest of the introduction.
*Material and Methods
Overall, revise proper referencing of machines, protocols, and materials used
Line 189: Is there no reference to that protocol?
Line 190: reference properly the machines used, Qubit should also have information about the company and country in parenthesis (ThermoFisher, Country).
Line 191: The DNA was not analyzed, it was sequenced.
Line185-193: Who did the library prep? Which library prep was used? Add information on this, as it is key information in this paper.
Line 195, section Mitochondrial DNA. Just out of curiosity, was the mitochondrial genome not found among the PacBio raw data? Even if not assembled, it is usually possible to find circularized reads that perfectly cover it. It could be a good way to test that the separate method gives a similar result.
Line 264: Specify company or reference information on NEBNext Ultra…
Line 267: Ampure beads, again, company missing2023.08.29.555262v2 review
Line 274: it is not clear to me why those arguments are necessary to integrate the Hi-C data, if that is done after, separately from that first genome assembly. Could you describe more clearly that part of the methods?
Line 280: What was used to sort and deduplicate?
Line 298: there is no explanation on how the RNA sequences were assembled. In line 134 you mention that RNA was also extracted, but then there are no methods on how it was sequenced, and then assembled, as mentioned later in the genome annotation section. If it has been published in a previous publication, there is also no reference to that.
Line 316: Does Geneious have a reference? If not, the official website can be an alternative to that.*
Re: These should be now all covered.
*Line 320: I am aware that manual curation is very difficult to describe, and it ends up
becoming like this method black box in the end, but it would be nice to know what level of manual curation was done. It is not the same to fix a couple of wrongly annotated genes, compared to rearrange contigs and scaffolds. It would be nice to just mention an approximate description of how much manual work it was done in this precise genome assembly. I saw after that in lines 352-353 it is described more in detail; I would appreciate something like that in the methods. Just specifying the type of curations that were made.*
Re: Modified as suggested.
*Results
Line 332-334: I would add this at the end or beginning of the material and methods section. But this brings up another problem: in the methods, only the primary assembly has been described, and nothing about an alternative assembly. The methods should include all information about how the results have been produced!*
Re: HiFiasm is a haplotype aware assembler. It counts the k-mers and assembles unique k-mers, then it goes through a few rounds of purging, where it uses the k-mer coverage to say this is part of the genome is the same as this part as it has the expected coverage of a heterozygous region then it collapses those into a single sequence, which is then the primary and the other one is alternative. Then the Hi-C data is added onto this to phase variants and identify connected regions. This means that the output can generate a primary and alternative assembly as you separate out the alleles. But we don’t know which allele is from which parent without having the parent information too. The assembler will create a primary assembly, with the most complete haplotype representing the primary assembly and the other being the alternative. The website included here (https://lh3.github.io/2021/04/17/concepts-in-phased-assemblies) provides a good definition of phased assembly definitions. We have now tried to explain this also in the methods.
*Line 338: estimated, not published
Line 341: using a k-mer size of 21 (no parenthesis needed). Maybe the sentence as a whole needs rephrasing.
Line 347: the largest scaffold is the same as before, so I would phrase it better to refer to that back. Right now, it seems like it is different.
Lines 337-351: Be a bit clearer on the differences between the two assemblies. Right now, it is just a lot of sentences giving numbers, but I am missing some clear distinction and similarities between them. I do not mean as a discussion, but more to make it clear to the reader. When the Hi-C scaffold assembly starts being described, again all the numbers are mentioned and it is hard to remember what was the same or what has changed from the previous. I would revise the entire paragraph to ease the reading, while still maintaining all the important numbers and data.*
Re: These points should be now covered, thank you.
*Line 352: Hi-C maps is not mentioned in material and methods, and it is not referenced in this first time that comes up.*
Re: This is now referenced and elaborated, thank you.
*Line 356: 93.16% what? Contigs? Scaffolds? Total Mb?*
Re: This is mentioned in the methods but has now been clarified.
*Figure 2: A -> Genomescope2 is not mentioned in material and methods nor properly
referenced. B -> What was used to create the Hi-C map? C -> even if explained in methods, and noted on the axis, add explanation of what is the previous assembly.*
Re: Apologies for the oversight, Genomescope2 is now referenced in the methods. YaHS was used to produce the pretext map and is now mentioned. Figure 2C modified and an explanation added.
*Line 370: You keep referring to the previous genome assembly, but not clarifying that it is not the same hare species.*
Re: Thank you for pointing these out. We have been simultaneously working on the new mountain hare assembly, which is why the confusion.
*Line 368-381: Has anyone checked if there is multicopy genes in the repetitive element libraries? It is something worth doing, as you might be hiding important host genes in the masked section of the genome assembly.*
Re: It is a valid point, applying to all reference genome projects. We have had a quick look and all the gene sequences seem to be from retroelements or endoretroviruses. Unfortunately, we cannot address this issue in more detail now and can only point it out in the discussion as a potential trade-off.
*Line 385: refrain from using “genome”, use genome assembly instead.
Line 389: there are*
Line 395: in the assembly**
Re: Corrected, thank you.
*Line 395: Why only discuss a fusion in the rabbit and not a chromosomal split in the hare?*
Re: Sentence modified so that causality is not implied.
*Line 397: again, genome, and then reference genomes. Specify as genome assemblies.*
Re: Corrected, thank you.
*Line 402 answers my previous question on the mitochondrial data. But I am surprised that no comparison with the actual sequence from HiFi is not done. It could be a great way to test if your sequence is correct.*
Re: We agree that it is a bit incoherent for the rest of the story, but we had the unpublished mtDNA sequence already before.
*Discussion
Too many times throughout the discussion to note them down: change genome to genome ASSEMBLY.*
Re: Corrected, thank you.
*Line 439: If my comment above is to be investigated in future studies (possibility of host genes in the repeat libraries), mention in the discussion that possibility, so that it is clear that it could be done in the future. Right now, I would either suggest a major revision on that, or being very clear about being aware of the possible problems that might have caused that higher level of repeats in your genome assembly.*
Re: As pointed out, this would then be a universal issue with the masking of the repeats. We have now added a note into the discussion regarding the issue.
*Line 463-467: consider rephrasing those sentences, it is a bit confusing as it is now, with very long sentences and not good flow.
Line 467: Heteroplasmy already commonly denotes more than one mitochondrial genome variants, with no specification of two or more. Instead of saying “multiple heteroplasmy”, it would be better to add the clarification of what you mean by that after mentioning heteroplasmy.*
Re: These sections have now been rephrased or removed.
*Line 470-482: It would be nice to add some references here. Including methods and the mentioned law. The discussion is a bit weirdly structured. It starts with a very short paragraph of the actual results here presented, followed by a long explanation on the mtDNA, and a big section on DNA obtention for these kinds of results. Meanwhile, during the results, there were many parts that could be considered discussion, with references added to what was being presented. I would suggest moving some of that information to the discussion, to properly discuss the results that are presented in the actual study.*
Re: References added. We have now made an attempt to improve the discussion as suggested. Many thanks again for the constructive feedback!
Decision by Ed Hollox, posted 07 Dec 2023, validated 09 Dec 2023
The authors should address the suggestions and recommendations made by the reviewers.
Reviewed by anonymous reviewer 1, 07 Dec 2023
The authors present a novel de novo assembly of the brown hare Lepus europaeus. The work appears solid and I cannot see any major flaw. The presentation and writing are mostly clear, although in part a bit lengthy - it will be up to the final journal to decide if some of the more extraneous information (e.g. on history, cultural importance of hares, and high-quality DNA sampling) should remain in the final paper.
I have annotated the PDF with a range of comments and suggested edits, hoping that this will be useful for the authors to clarify and strengthen a few sections, incl. removal of some wording that feels rather subjective in places.
Download the review