Recommendation

A pipeline to select SARS-CoV-2 sequences for reliable phylodynamic analyses

Emmanuelle Lerat based on reviews by Gabriel Wallau and Bastien Boussau

A recommendation of:

COVFlow: phylodynamics analyses of viruses from selected SARS-CoV-2 genome sequences

Gonché Danesh, Corentin Boennec, Laura Verdurme, Mathilde Roussel, Sabine Trombert-Paolantoni, Benoit Visseaux, Stephanie Haim-Boukobza, Samuel Alizon (2023), bioRxiv, ver.4, peer-reviewed and recommended by PCI Genomics https://doi.org/10.1101/2022.06.17.496544

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

COVFlow: phylodynamics analyses of viruses from selected SARS-CoV-2 genome sequences

Phylodynamic analyses generate important and timely data to optimise public health response to SARS-CoV-2 outbreaks and epidemics. However, their implementation is hampered by the massive amount of sequence data and the difficulty to parameterise dedicated software packages. We introduce the COVFlow pipeline, accessible at https://gitlab.in2p3.fr/ete/CoV-flow , which allows a user to select sequences from the Global Initiative on Sharing Avian Influenza Data (GISAID) database according to user-specified criteria, to perform basic phylogenetic analyses, and to produce an XML file to be run in the Beast2 software package. We illustrate the potential of this tool by studying two sets of sequences from the Delta variant in two French regions. This pipeline can facilitate the use of virus sequence data at the local level, for instance, to track the dynamics of a particular lineage or variant in a region of interest.

COVID-19, molecular epidemiology, sequence database, phylogenetics, public health

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

COVFlow: تحليلات ديناميكيات السلالات للفيروسات من تسلسلات جينوم SARS-CoV-2 مختارة

تولد التحليلات الديناميكية الديناميكية بيانات مهمة وفي الوقت المناسب لتحسين استجابة الصحة العامة لتفشي وأوبئة SARS-CoV-2. ومع ذلك، فإن تنفيذها يعوقه الكم الهائل من بيانات التسلسل وصعوبة تحديد معلمات حزم البرامج المخصصة. نحن نقدم خط أنابيب COVFlow، الذي يمكن الوصول إليه على https://gitlab.in2p3.fr/ ete/CoV-flow ، والذي يسمح للمستخدم باختيار تسلسلات من قاعدة بيانات المبادرة العالمية لتبادل بيانات أنفلونزا الطيور (GISAID) وفقًا للمعايير التي يحددها المستخدم، لإجراء تحليلات النشوء والتطور الأساسية، وإنتاج ملف XML إلى يتم تشغيله في حزمة برامج Beast2. نوضح إمكانات هذه الأداة من خلال دراسة مجموعتين من التسلسلات من متغير دلتا في منطقتين فرنسيتين. يمكن أن يسهل خط الأنابيب هذا استخدام بيانات تسلسل الفيروسات على المستوى المحلي، على سبيل المثال، لتتبع ديناميكيات سلالة معينة أو متغير معين في منطقة محل الاهتمام.

كوفيد-19، علم الأوبئة الجزيئية، قاعدة بيانات التسلسل، علم الوراثة، الصحة العامة

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

COVFlow: análisis filodinámicos de virus de secuencias seleccionadas del genoma del SARS-CoV-2

Los análisis filodinámicos generan datos importantes y oportunos para optimizar la respuesta de salud pública a los brotes y epidemias de SARS-CoV-2. Sin embargo, su implementación se ve obstaculizada por la enorme cantidad de datos de secuencia y la dificultad para parametrizar paquetes de software dedicados. Presentamos el canal COVFlow, accesible en https://gitlab.in2p3.fr/ ete/CoV-flow , que permite al usuario seleccionar secuencias de la base de datos de la Iniciativa global para compartir datos sobre la influenza aviar (GISAID) de acuerdo con criterios especificados por el usuario, realizar análisis filogenéticos básicos y producir un archivo XML para ejecutarse en el paquete de software Beast2. Ilustramos el potencial de esta herramienta estudiando dos conjuntos de secuencias de la variante Delta en dos regiones francesas. Este canal puede facilitar el uso de datos de secuencia de virus a nivel local, por ejemplo, para rastrear la dinámica de un linaje o variante particular en una región de interés.

COVID-19, epidemiología molecular, base de datos de secuencias, filogenética, salud pública

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

COVFlow : analyses phylodynamiques de virus à partir de séquences sélectionnées du génome du SRAS-CoV-2

Les analyses phylodynamiques génèrent des données importantes et opportunes pour optimiser la réponse de santé publique aux flambées et épidémies de SRAS-CoV-2. Cependant, leur mise en œuvre est freinée par la quantité massive de données de séquence et la difficulté de paramétrer des progiciels dédiés. Nous présentons le pipeline COVFlow, accessible sur https://gitlab.in2p3.fr/ ete/CoV-flow , qui permet à un utilisateur de sélectionner des séquences dans la base de données de la Global Initiative on Sharing Avian Influenza Data (GISAID) selon des critères spécifiés par l'utilisateur, d'effectuer des analyses phylogénétiques de base et de produire un fichier XML pour être exécuté dans le progiciel Beast2. Nous illustrons le potentiel de cet outil en étudiant deux ensembles de séquences du variant Delta dans deux régions françaises. Ce pipeline peut faciliter l'utilisation des données de séquence virale au niveau local, par exemple, pour suivre la dynamique d'une lignée ou d'un variant particulier dans une région d'intérêt.

COVID-19, épidémiologie moléculaire, base de données de séquences, phylogénétique, santé publique

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

COVFlow: फ़ाइलोडायनामिक्स चयनित SARS-CoV-2 जीनोम अनुक्रमों से वायरस का विश्लेषण करता है

फाइलोडायनामिक विश्लेषण SARS-CoV-2 के प्रकोप और महामारी के प्रति सार्वजनिक स्वास्थ्य प्रतिक्रिया को अनुकूलित करने के लिए महत्वपूर्ण और समय पर डेटा उत्पन्न करते हैं। हालाँकि, अनुक्रम डेटा की भारी मात्रा और समर्पित सॉफ़्टवेयर पैकेजों को पैरामीटराइज़ करने में कठिनाई के कारण उनका कार्यान्वयन बाधित होता है। हम COVFlow पाइपलाइन पेश करते हैं, जो https://gitlab.in2p3.fr/ पर पहुंच योग्य है। ete/CoV-प्रवाह , जो उपयोगकर्ता को उपयोगकर्ता-निर्दिष्ट मानदंडों के अनुसार एवियन इन्फ्लुएंजा डेटा साझा करने पर वैश्विक पहल (जीआईएसएआईडी) डेटाबेस से अनुक्रमों का चयन करने, बुनियादी फ़ाइलोजेनेटिक विश्लेषण करने और एक एक्सएमएल फ़ाइल तैयार करने की अनुमति देता है। Beast2 सॉफ़्टवेयर पैकेज में चलाया जाए। हम दो फ्रांसीसी क्षेत्रों में डेल्टा संस्करण के अनुक्रमों के दो सेटों का अध्ययन करके इस उपकरण की क्षमता का वर्णन करते हैं। यह पाइपलाइन स्थानीय स्तर पर वायरस अनुक्रम डेटा के उपयोग की सुविधा प्रदान कर सकती है, उदाहरण के लिए, रुचि के क्षेत्र में किसी विशेष वंश या संस्करण की गतिशीलता को ट्रैक करने के लिए।

कोविड-19, आणविक महामारी विज्ञान, अनुक्रम डेटाबेस, फाइलोजेनेटिक्स, सार्वजनिक स्वास्थ्य

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

COVFlow: 選択された SARS-CoV-2 ゲノム配列からのウイルスの系統力学分析

系統力学分析は、SARS-CoV-2 の発生や流行に対する公衆衛生上の対応を最適化するための重要かつタイムリーなデータを生成します。ただし、その実装は、大量のシーケンスデータと専用ソフトウェアパッケージのパラメータ化の難しさによって妨げられています。 COVFlow パイプラインを紹介します。 https://gitlab.in2p3.fr/ からアクセスできます。 ete/CoV-flow により、ユーザーは、ユーザー指定の基準に従って鳥インフルエンザデータ共有に関するグローバルイニシアチブ (GISAID) データベースから配列を選択し、基本的な系統解析を実行し、XML ファイルを生成することができます。 Beast2 ソフトウェアパッケージで実行できます。我々は、2 つのフランス地域のデルタ変異体からの 2 セットの配列を研究することによって、このツールの可能性を説明します。このパイプラインにより、たとえば、対象領域内の特定の系統や変異体の動態を追跡するなど、ローカルレベルでのウイルス配列データの使用が容易になります。

新型コロナウイルス感染症、分子疫学、配列データベース、系統発生学、公衆衛生

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

COVFlow: análises filodinâmicas de vírus de sequências selecionadas do genoma do SARS-CoV-2

As análises filodinâmicas geram dados importantes e oportunos para otimizar a resposta da saúde pública aos surtos e epidemias de SARS-CoV-2. No entanto, a sua implementação é dificultada pela enorme quantidade de dados de sequência e pela dificuldade de parametrizar pacotes de software dedicados. Apresentamos o pipeline COVFlow, acessível em https://gitlab.in2p3.fr/ ete/CoV-flow , que permite ao usuário selecionar sequências do banco de dados da Iniciativa Global sobre Compartilhamento de Dados sobre a Gripe Aviária (GISAID) de acordo com critérios especificados pelo usuário, realizar análises filogenéticas básicas e produzir um arquivo XML para ser executado no pacote de software Beast2. Ilustramos o potencial desta ferramenta estudando dois conjuntos de sequências da variante Delta em duas regiões francesas. Este pipeline pode facilitar o uso de dados de sequência de vírus em nível local, por exemplo, para rastrear a dinâmica de uma linhagem ou variante específica em uma região de interesse.

COVID-19, epidemiologia molecular, banco de dados de sequências, filogenética, saúde pública

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

COVFlow: филодинамический анализ вирусов из выбранных последовательностей генома SARS-CoV-2.

Филодинамический анализ позволяет получить важные и своевременные данные для оптимизации реагирования общественного здравоохранения на вспышки и эпидемии SARS-CoV-2. Однако их реализация затруднена из-за огромного количества данных о последовательностях и сложности параметризации специальных программных пакетов. Мы представляем конвейер COVFlow, доступный по адресу https://gitlab.in2p3.fr/ ete/CoV-flow , который позволяет пользователю выбирать последовательности из базы данных Глобальной инициативы по обмену данными о птичьем гриппе (GISAID) в соответствии с заданными пользователем критериями, выполнять базовый филогенетический анализ и создавать XML-файл для запускаться в программном пакете Beast2. Мы иллюстрируем потенциал этого инструмента, изучая два набора последовательностей варианта Дельта в двух регионах Франции. Этот конвейер может облегчить использование данных о последовательностях вирусов на локальном уровне, например, для отслеживания динамики определенной линии или варианта в интересующем регионе.

COVID-19, молекулярная эпидемиология, база данных последовательностей, филогенетика, общественное здравоохранение

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

COVFlow：对选定的 SARS-CoV-2 基因组序列中的病毒进行系统动力学分析

系统动力学分析可生成重要且及时的数据，以优化公共卫生对 SARS-CoV-2 爆发和流行病的反应。然而，它们的实施受到大量序列数据和参数化专用软件包的困难的阻碍。我们介绍了 COVFlow 管道，可通过 https://gitlab.in2p3.fr/ 访问ete/CoV-flow ，它允许用户根据用户指定的标准从全球禽流感数据共享倡议（GISAID）数据库中选择序列，执行基本的系统发育分析，并生成 XML 文件在Beast2软件包中运行。我们通过研究法国两个地区的 Delta 变体的两组序列来说明该工具的潜力。该管道可以促进在本地水平使用病毒序列数据，例如，跟踪感兴趣区域中特定谱系或变体的动态。

COVID-19、分子流行病学、序列数据库、系统发育学、公共卫生

Submission: posted 12 December 2022, validated 13 December 2022
Recommendation: posted 07 September 2023, validated 11 September 2023

Cite this recommendation as:
Lerat, E. (2023) A pipeline to select SARS-CoV-2 sequences for reliable phylodynamic analyses. Peer Community in Genomics, 100239. https://doi.org/10.24072/pci.genomics.100239

Recommendation

Phylodynamic approaches enable viral genetic variation to be tracked over time, providing insight into pathogen phylogenetic relationships and epidemiological dynamics. These are important methods for monitoring viral spread, and identifying important parameters such as transmission rate, geographic origin and duration of infection [1]. This knowledge makes it possible to adjust public health measures in real-time and was important in the case of the COVID-19 pandemic [2]. However, these approaches can be complicated to use when combining a very large number of sequences. This was particularly true during the COVID-19 pandemic, when sequencing data representing millions of entire viral genomes was generated, with associated metadata enabling their precise identification.

Danesh et al. [3] present a bioinformatics pipeline, CovFlow, for selecting relevant sequences according to user-defined criteria to produce files that can be used directly for phylodynamic analyses. The selection of sequences first involves a quality filter on the size of the sequences and the absence of unresolved bases before being able to make choices based on the associated metadata. Once the sequences are selected, they are aligned and a time-scaled phylogenetic tree is inferred. An output file in a format directly usable by BEAST 2 [4] is finally generated.

To illustrate the use of the pipeline, Danesh et al. [3] present an analysis of the Delta variant in two regions of France. They observed a delay in the start of the epidemic depending on the region. In addition, they identified genetic variation linked to the start of the school year and the extension of vaccination, as well as the arrival of a new variant. This tool will be of major interest to researchers analysing SARS-CoV-2 sequencing data, and a number of future developments are planned by the authors.

References

[1] Baele G, Dellicour S, Suchard MA, Lemey P, Vrancken B. 2018. Recent advances in computational phylodynamics. Curr Opin Virol. 31:24-32. https://doi.org/10.1016/j.coviro.2018.08.009

[2] Attwood SW, Hill SC, Aanensen DM, Connor TR, Pybus OG. 2022. Phylogenetic and phylodynamic approaches to understanding and combating the early SARS-CoV-2 pandemic. Nat Rev Genet. 23:547-562. https://doi.org/10.1038/s41576-022-00483-8

[3] Danesh G, Boennec C, Verdurme L, Roussel M, Trombert-Paolantoni S, Visseaux B, Haim-Boukobza S, Alizon S. 2023. COVFlow: phylodynamics analyses of viruses from selected SARS-CoV-2 genome sequences. bioRxiv, ver. 7 peer-reviewed and recommended by Peer Community in Genomics. https://doi.org/10.1101/2022.06.17.496544

[4] Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H et al. 2014. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol 10: e1003537. https://doi.org/10.1371/journal.pcbi.1003537

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
This project was supported by the Agence Nationale de la Recherche Maladies Infectieuses Émergentes to the MODVAR project (grant no. ANRS0151)

Reviews

Evaluation round #3

DOI or URL of the preprint: https://doi.org/10.1101/2022.06.17.496544

Version of the preprint: 3

Author's Reply, 06 Sep 2023

Dear Dr. Lerat,

As suggested the title has been changed to "COVFlow: phylodynamics analyses of viruses from selected SARS-CoV-2 genome sequences".

Best,

-Gonché Danesh and co-authors

https://doi.org/10.24072/pci.genomics.100239.ar3

Decision by Emmanuelle Lerat, posted 04 Sep 2023, validated 05 Sep 2023

Dear Dr. Danesh,

Thank you for submitting a revised version of your manuscript.

Before I can make my final recommendation, could you slightly change the title of your manuscript?

Could you use one of the following alternatives “COVFlow: viral phylodynamics analyses from selected SARS-CoV-2 genome sequences” or “COVFlow: phylodynamics analyses of viruses from selected SARS-CoV-2 genome sequences”. The second one may be better due to the increase in the non-biological reference to the term “viral” in recent years, however both are suitable.

Sincerely,

Emmanuelle Lerat

https://doi.org/10.24072/pci.genomics.100239.d3

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/2022.06.17.496544

Version of the preprint: 2

Author's Reply, 20 Jul 2023

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.genomics.100239.ar2

Decision by Emmanuelle Lerat, posted 13 Jul 2023, validated 13 Jul 2023

Dear Dr Gonché,

Thank you for the revised version of your manuscript. Only few minor points remained to be done before I can recommend your article. Please consider carefullly the propositions of the reviewer.

Sincerely

Emmanuelle Lerat

https://doi.org/10.24072/pci.genomics.100239.d2

Reviewed by Gabriel Wallau, 08 Jul 2023

Danesh and collaborators reviewed the manuscript adding and adjusting parameters of the COVFflow pipeline, clarifying some sections, highlighting some of the limitations of the pipeline configurations and analysis along with more robust analyses. Therefore, I recomment its acceptance after minor edits as below.

More specifically, the authors improved the results section regarding the comparison with nextstrain. I recommend the authors to include the information present in one of their answers: "For example, it can select data if a column contains a certain word, allowing the user to filter data that may contain spelling mistakes or to select data from a group of laboratories that contain a common word (in our case CERBA) but don’t have the same names". Only including "COVflow allows a more flexible filtering stage using the JSON file" (page 8 - line 169) don't make it clear.

The authors also created a test dataset and updated the workflow documentation accordingly. There is a divergence in documentation and the test data. The test files zip on repository are covflow_test_dataset.zip that englobes covflow_test_metadata.tsv and covflow_test_sequences.fasta and in the documentation is informed: "In the data directory, the compressed archive data_test.zip contains a fasta file (sequences.fasta) and tsv file (metadata.tsv)." So I recommend the authors to correct the name of test files, or the documentation.

Page 8 line 166 - change Nextrain to Nextclade

Page 30 line 261 - “from raw sequence data to phylodynamics analyses.” its looks like that covflow perform raw sequence reads analysis and genome assembly which is not the case. Please correct it.

https://doi.org/10.24072/pci.genomics.100239.rev21

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2022.06.17.496544

Version of the preprint: 1

Author's Reply, 12 Jun 2023

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.genomics.100239.ar1

Decision by Emmanuelle Lerat, posted 06 Mar 2023, validated 08 Mar 2023

Dear Dr Gonché,

I have now received the comments of two reviewers for your manuscript. They both find your work very interesting but point some important issues that need to be addressed, especially the lack of data test. Moreover, reviewer 1 had difficulty accessing the program.

Sincerely

Emmanuelle Lerat

https://doi.org/10.24072/pci.genomics.100239.d1

Reviewed by Bastien Boussau, 05 Mar 2023

Danesh et al. present a pipeline to select sequences according to a range of criteria from data downloaded from Gisaid. It can be used to select sequences from a specific range of dates, from specific regions, from specific sequencing laboratories, from specific viral lineages, by specifying the options in a yaml file. Further filtering options can be used as well according to a json file. Once the data has been filtered, the pipeline can align the sequences, construct and date a phylogeny, and build the configuration file for a BEAST2 birth-death skyline plot analysis. The manuscript describes the pipeline, and presents an example analysis that has been performed with the pipeline.

The manuscript is clear and easy to read. The tool provided is likely to be useful, is easy to install (although I report below one typo), with a very good documentation, but it is somewhat hard to test, because the authors have not provided a test data set. This is likely due to GISAID rules, which prevent distributing subsets of their data. However, the authors could build a mock data set, with a mock aligment (which can be very small), and mock metadata, on which their pipeline could run. This example might be useful to users who first want to try the tool on a small data set before they try it on their own.

Another recommendation I have would be to comment on the importance of the different priors users have to set for the parameters of the BDSKY analysis (in the minor comments below I point out that the "origin" parameter might be an important one). Some parameters may have a stronger influence than others, and their impact on the analysis may not be obvious to users. This information could be provided on their gitlab website.

Minor comments:
l22: "were made available" : have been made available
l23: "This allowed" : has allowed
l31: "go from dates virus sequence data": dated
l122: "This yields infectious periods varying from 1.2 to 36.5 days": perhaps specify the mean, and clarify a bit because the parameter was specified a couple of lines above as per year, whereas this sentence is in days, which can be a bit confusing.
l128: "The default prior for this parameter prior is a uniform distribution Uniform(0, 2) years.": this prior seems a bit dangerous for naive users who may be using the method in the future. If they don't change it, it seems like they would not be able to infer origin dates older than 2 years from the date of their analysis.
Fig. 2 legend: "In panel c, the lined show" : lines
l155: "allow us to visualise": allows us
l161: "variant epidemic seems to occurred earlier and more frequently": have occurred
l168: "PACA experience a period of Delta variant growth": experienced
l180: "from the GISAID datased" : dataset

Software test:
cd cov-flow : cd Cov-flow

https://doi.org/10.24072/pci.genomics.100239.rev11

Reviewed by Gabriel Wallau, 30 Jan 2023

Danesh and collaborators presented COVFlow, a computational pipeline aimed to perform sample selection and phylodynamic analysis of SARS-CoV-2 sequences. Due to the huge amount of SARS-CoV-2 sequences available in public databases such pipelines are in much need to select datasets that are amenable to computational analysis and inferences. Therefore, COVFlow addresses an important bottleneck in the field of genomic surveillance particularly regarding the generation of virus transmission rate inferences that is a key information to inform the public health decision making process. However, from the application point of view this pipeline is able to perform similar steps already performed by other highly used software (i.e. Nextclade). In addition, I could not test the pipeline due to user permission restriction. In summary, I suggest a number of modifications and clarifications in the manuscript to be able to reassess its in more in detail.

Comments and requests

Page 3 - line 31 - I suggest changing “dates virus sequence data” to “data stamped virus sequence data.”

page 3 - lines 40-41. What authors meant with “However, these do not include a data filtration step based on metadata characteristics.”? The nextstrain CLI tool, which includes Augur in some steps, allows the user to filter data based on different metadata (see https://docs.nextstrain.org/projects/ncov/en/latest/guides/workflow-config-file.html), such as: collection date, pangolin lineage, genome length, host, geographic information (region, country, division, location). I suggest the authors clarify which metadata COVFlow can filter out that nextclade can not. Moreover, I recommend the authors to describe the advantages of each step of CovFlow (filtering, alignment, masking sites and build tree) when compared with nextstrain (https://docs.nextstrain.org/en/latest/learn/parts.html). From my point of view there are two new COVFlow features compared with nextstrain CLI, that is, subsampling appears to be a proportional sampling in the models (instead of a absolute number per sampling group model that can be set in nextclade) and the generation of XML file to be used on beast2.

Page 6 - line 96 - Why the authors used the option “and ’addfragments’” if the sequences are almost all full length? Maybe the --add option is enough. Please clarify.

Page 7-9 - lines 137-152 - Regarding the selection of samples for BDSKY analysis, one key step is the selection of monophyletic clades and then performing Re estimates on them separately. Otherwise, Re estimates could be much biased by inferring transmission timing dynamics from unstable “clades”, which means that every run of the pipeline may generate a different time-tree structure and reach different Re estimates. Did the authors include such a step on the pipeline? Please clarify.

Moreover, figure 2 lower section should be depicted with case numbers from each region and the whole country to evaluate if the Re estimates are compatible with the epidemiological curves. I suggest three different plots, one for each region considered.

At the moment of Delta variant spread the population had already a complex mix of acquired and vaccine induced immunity. It would be interesting to add the vaccination rate from each region through time in this figure as well.

Page 9 - lines 168-171. Are there any other available genomic data that could provide some additional lines of evidence of a Delta growth at this time point besides inference tests? Proportion of genomic defined lineages? One suggestion is to plot the lineage GISAID data itself from each region and France alongside Figure 2C.

Git Lab issues

Following the Gitllab instructions on installing COVflow, the git clone section returns: fatal: Could not read from remote repository.

The authors should clarify how to obtain the tsv metadata file. Can it be obtained from the general metadata present on the Download section of GISAID - EpiCov or it came from the metadata available after a sequence selection performed in the search interface of GISAID - EpiCov? If the metadata file has more columns than the ones specified on Metadata Fields would COVFlow still work?

I suggest that the authors create a test dataset with fasta and metadata files or inform a way that the user can recover it from Gisaid and an associated step-by-step guide that could be followed by the user to perform a test analysis with the current json files present in examples directory. This will facilitate the user implementation of COVFlow through simple testing.

https://doi.org/10.24072/pci.genomics.100239.rev12

User comments

No user comments yet

or Register
Submit a preprint