lenaschimmel / sc2rf Goto Github PK

View Code? Open in Web Editor NEW

48.0 9.0 13.0 810 KB

SARS-Cov-2 Recombinant Finder for fasta sequences

License: MIT License

Python 100.00%

sars-cov-2 mutations recombinants covid genetic

sc2rf's Introduction

Sc2rf - SARS-Cov-2 Recombinant Finder

Pronounced: Scarf

What's this?

Sc2rf can search genome sequences of SARS-CoV-2 for potential recombinants - new virus lineages that have (partial) genes from more than one parent lineage.

Is it already usable?

This is a very young project, started on March 5th, 2022. As such, proceed with care. Results may be wrong or misleading, and with every update, anything can still change a lot.

Anyway, I'm happy that scientists are already seeing benefits from Sc2rf and using it to prepare lineage proposals for cov-lineages/pango-designation.

Though I already have a lot of ideas and plans for Sc2rf (see at the bottom of this document), I'm very open for suggestions and feature requests. Please write an issue, start a discussion or get in touch via mail or twitter!

Example output

Requirements and Installation

You need at least Python 3.6 and you need to install the requirements first. You might use something like python3 -m pip install -r requirements.txt to do that. There's a setup.py which you should probably ignore, since it's work in progress and does not work as intented yet.

Also, you need a terminal which supports ANSI control sequences to display colored text. On Linux, MacOS, etc. it should probably work.

On Windows, color support is tricky. On a recent version of Windows 10, it should work, but if it doesn't, install Windows Terminal from GitHub or Microsoft Store and run it from there.

Basic Usage

Start with a .fasta file with one or more sequences which might contain recombinants. Your sequences have to be aligned to the reference.fasta. If they are not, you will get an error message like:

Sequence hCoV-19/Phantasialand/EFWEFWD not properly aligned, length is 29718 instead of 29903.

(For historical reasons, I always used Nextclade to get aligned sequences, but you might also use Nextalign or any other tool. Installing them is easy on Linux or MacOS, but not on Windows. You can also use a web-based tool like MAFFT.)

Then call:

sc2rf.py <your_filename.fasta>

If you just need some fasta files for testing, you can search the pango-lineage proposals for recombinant issues with fasta-files, or take some files from my shared-sequences repository, which might not contain any actual recombinants, but hundreds of sequences that look like they were!

No output / some sequences not shown

By default, a lot filters are active to show only the likely recombinants, so that you can input 10000s of sequences and just get output for the interesting ones. If you want, you can disable all filters like that, which is only recommended for small input files with less than 100 sequences:

sc2rf.py --parents 1-35 --breakpoints 0-100 \
--unique 1 --max-ambiguous 10000 <your_filename.fasta>

or even

sc2rf.py --parents 1-35 --breakpoints 0-100 \
--unique 1 --max-ambiguous 10000 --force-all-parents \
--clades all <your_filename.fasta>

The meaning of these parameters is described below.

Advanced Usage

You can execute sc2rf.py -h to get excactly this help message:

usage: sc2rf.py [-h] [--primers [PRIMER ...]]
                [--primer-intervals [INTERVAL ...]]
                [--parents INTERVAL] [--breakpoints INTERVAL]
                [--clades [CLADES ...]] [--unique NUM]
                [--max-intermission-length NUM]
                [--max-intermission-count NUM]
                [--max-name-length NUM] [--max-ambiguous NUM]
                [--force-all-parents]
                [--select-sequences INTERVAL]
                [--enable-deletions] [--show-private-mutations]
                [--rebuild-examples] [--mutation-threshold NUM]
                [--add-spaces [NUM]] [--sort-by-id [NUM]]
                [--verbose] [--ansi] [--hide-progress]
                [--csvfile CSVFILE]
                [input ...]

Analyse SARS-CoV-2 sequences for potential, unknown recombinant
variants.

positional arguments:
  input                 input sequence(s) to test, as aligned
                        .fasta file(s) (default: None)

optional arguments:
  -h, --help            show this help message and exit

  --primers [PRIMER ...]
                        Filenames of primer set(s) to visualize.
                        The .bed formats for ARTIC and EasySeq
                        are recognized and supported. (default:
                        None)

  --primer-intervals [INTERVAL ...]
                        Coordinate intervals in which to
                        visualize primers. (default: None)

  --parents INTERVAL, -p INTERVAL
                        Allowed number of potential parents of a
                        recombinant. (default: 2-4)

  --breakpoints INTERVAL, -b INTERVAL
                        Allowed number of breakpoints in a
                        recombinant. (default: 1-4)

  --clades [CLADES ...], -c [CLADES ...]
                        List of variants which are considered as
                        potential parents. Use Nextstrain clades
                        (like "21B"), or Pango Lineages (like
                        "B.1.617.1") or both. Also accepts "all".
                        (default: ['20I', '20H', '20J', '21I',
                        '21J', 'BA.1', 'BA.2', 'BA.3'])

  --unique NUM, -u NUM  Minimum of substitutions in a sample
                        which are unique to a potential parent
                        clade, so that the clade will be
                        considered. (default: 2)

  --max-intermission-length NUM, -l NUM
                        The maximum length of an intermission in
                        consecutive substitutions. Intermissions
                        are stretches to be ignored when counting
                        breakpoints. (default: 2)

  --max-intermission-count NUM, -i NUM
                        The maximum number of intermissions which
                        will be ignored. Surplus intermissions
                        count towards the number of breakpoints.
                        (default: 8)

  --max-name-length NUM, -n NUM
                        Only show up to NUM characters of sample
                        names. (default: 30)

  --max-ambiguous NUM, -a NUM
                        Maximum number of ambiguous nucs in a
                        sample before it gets ignored. (default:
                        50)

  --force-all-parents, -f
                        Force to consider all clades as potential
                        parents for all sequences. Only useful
                        for debugging.

  --select-sequences INTERVAL, -s INTERVAL
                        Use only a specific range of input
                        sequences. DOES NOT YET WORK WITH
                        MULTIPLE INPUT FILES. (default: 0-999999)

  --enable-deletions, -d
                        Include deletions in lineage comparision.

  --show-private-mutations
                        Display mutations which are not in any of
                        the potential parental clades.

  --rebuild-examples, -r
                        Rebuild the mutations in examples by
                        querying cov-spectrum.org.

  --mutation-threshold NUM, -t NUM
                        Consider mutations with a prevalence of
                        at least NUM as mandatory for a clade
                        (range 0.05 - 1.0, default: 0.75).

  --add-spaces [NUM]    Add spaces between every N colums, which
                        makes it easier to keep your eye at a
                        fixed place. (default without flag: 0,
                        default with flag: 5)

  --sort-by-id [NUM]    Sort the input sequences by the ID. If
                        you provide NUM, only the first NUM
                        characters are considered. Useful if this
                        correlates with meaning full meta
                        information, e.g. the sequencing lab.
                        (default without flag: 0, default with
                        flag: 999)

  --verbose, -v         Print some more information, mostly
                        useful for debugging.

  --ansi                Use only ASCII characters to be
                        compatible with ansilove.

  --hide-progress       Don't show progress bars during long
                        task.

  --csvfile CSVFILE     Path to write results in CSV format.
                        (default: None)

An Interval can be a single number ("3"), a closed interval
("2-5" ) or an open one ("4-" or "-7"). The limits are inclusive.
Only positive numbers are supported.

Interpreting the output

To be written...

There already is a short Twitter thread which explains the basics.

Source material attribution

virus_properties.json contains data from LAPIS / cov-spectrum which uses data from NCBI GenBank, prepared and hosted by Nextstrain, see blog post.
reference.fasta is taken from Nextstrain's nextclade_data, see NCBI for attribution.
mapping.csv is a modified version of the table on the covariants homepage by Nextstrain.
Example output / screenshot based on Sequences published by the German Robert-Koch-Institut.
Primers:
- ARTIC primers CC-BY-4.0 by the ARTICnetwork project
- ~~EasySeq primers by Coolen, J. P., Wolters, F., Tostmann, A., van Groningen, L. F., Bleeker-Rovers, C. P., Tan, E. C., ... & Melchers, W. J.~~ Removed until I understand the format if the .bed file. There will be an issue soon.
- midnight primers CC-BY-4.0 by Silander, Olin K, Massey University

The initial version of this program was written in cooperation with @flauschzelle.

TODO / IDEAS / PLANS

sc2rf's People

Contributors

Stargazers

Watchers

Forkers

pastvir ljones1359 olintoyale wdkaye ktmeaton artpoon boyertheo paulinars ibdc-inda ttl074 bccdc-phl svn-phd genostack

sc2rf's Issues

Write documentation on how to interpret the output

Later, there should be a proper tutorial / walkthough. Like the whirlwind-tour in the documentation of xsv.

For the moment, I think the parameters are documented sufficiently, but the output is not.

This applies for the main output, but even more so for the new primer visualizations.

Option to ignore shared substitutions

I've been experimenting with a flag --ignore-shared that ignores positions that are shared (have the exact same nucleotide) across all parents/examples.
I like this option because it makes the breakpoints visually clearer, as there's a direct color change (red -> green) rather than having the intermediate shared positions (red -> white -> green)
For testing, a nextclade fasta alignment of XM-like recombinants (public on genbank): XM.txt

Do you think this is scientifically sound for reporting? And if so,
Would you be interested in a PR if I tidy up the code?

Default Output:

python3 sc2rf.py XM.fasta --ansi --unique 1

Proposed Option:

python3 sc2rf.py XM.fasta --ansi --unique 1 --ignore-shared

--csvfile option does not work

Hey!

First of all, great tool to find the potential recombinants. Made my life easy.
I needed to parse the output of sc2rf only to get the potential recombinant sequences and the breakpoints of it. I see the --csvfile option in the README. But, it must not have been included in the sc2rf python executable. I get this error.

sc2rf.py: error: unrecognized arguments: --csvfile output.csv

Any idea if I could get the ouput in the way I need?

GISAID XT recombinant not detected by sc2rf

Hi, I've noticed that sc2rf.py (version sc2rf-7427d2f94b69c965362034c2597b643c5dfaa1cf) could not find any recombination for XT samples available on GISAID python sc2rf.py nextclade.aligned_XT_Gisaid.fasta. Here are the available aligned sequences.
nextclade.aligned_XT_Gisaid.txt

Nextclade:

sc2rf:

Thanks for looking into this and other lineages that might be in the same situation.

ENH: provide output optionally as csv/tsv for automated analysis/sharing

Right now the output is good for interactive human analysis, but there's a lack of csv/tsv machine readable output for sharing/further analysis.

From my experience with Nextclade, main difficulty here is the design of the specs of the file, which columns to include etc, which separators to take if you need an intra-column separator etc.

Maybe best to discuss on this issue before implementing something as one will kind of get locked in to the format.

Bridging the gap between sc2rf result and Pangolin X* lineages

First, thanks to the authors for bringing the useful tool for us.

We have been using sc2rf to scan for recombinant sequences and determine breakpoint, but i found from the result to the Pangolin X* lineage calls there is a gap. I was wondering whether it is possible to bridge the gap by: 1. take in the lineage designation from Pangolin X* lineages, scan and store the profiles for each of the recombinant lineages; 2. for a new query sequence, if the breakpoint profile matches existing Pangolin X* lineages, in the result not just suggest the parent lineages and breakpoint, provide a possible X* lineage call as well. More or less in the way of how the Scorpio Constellation works.

I expect this would be a more accurate way of assigning recombinant lineages than the current UShER calls, where the breakpoint positions may not match.

Thanks for considering the suggestion.

Find and use better source for typical mutations of lineages

See this comment by @AngieHinrichs which even contains an alternative.

Thanks a lot for your detailed explanation! I'm trying to move this over here so it's easier to find for me.

(Also, if the comment thread over at pange-designation gets locked down after too many "off topic" comments, I won't be able to comment there at all. Already happened in other issues.)

ENH: add spaces between genes rather than every 5 mutations

Alternative spacing between genes would be great so one can remember gene boundaries

ENH: Accept MAPLE file as alternative to fasta

I think it could speed up the algorithm quite a bit if you accepted sequences in MAPLE format rather than fasta.

Maple contains basically all the info you want, all the mutations. So there's be no need to recompute.

I can see if we can produce MAPLE files from Nextclade. It would make sense to produce a human readable complete output.

It's probably a bit early to implement this before there are good tools to produce maple files - but I think it's neat that this format would speed up sc2rf a lot!

https://www.biorxiv.org/content/10.1101/2022.03.22.485312v1.full

Make tool pip-installable

Shouldn't be difficult, you need a setup.py and account of Pypi

You can have a look at this repo of mine that can be installed via Pypi as a command line tool (if you install it, the command becomes automatically available in Path!)
https://github.com/corneliusroemer/fasta_zstd_sqlite/blob/master/setup.py

ENH: Allow tool to run in a web browser

It'd be super cool if the tool ran in a web browser.

Drop (aligned) sequences and see it displayed.

This may be not worth the probably not quite insignificant effort though.

It would reduce barrier to entry significantly.

BUG: Problem using covSpectrum mutation share - Ns are treated as reference

There's a bit of a problem with using covSpectrum's current mutation API implementation: Ns in any sample is treated as reference.

This can cause confusion. For example, I thought that this intermission here within Spike was a bad sign:

cov-lineages/pango-designation#498

But it isn't! Both 22813 and 22882 are defining for both BA.1 and BA.2. However, both are apparently N in 40% of sequences in BA.1. Causing sc2rf to think that it's in fact not a defining mutation in BA.1 making spurious intermissions appear.

I'm not sure how to work around this best. Really, this should be fixed in covSpectrum: Ns should be left out of mutation proportion calculations - and not be treated as reference (implicitly).

@chaoran-chen can you think of a workaround? How can one get the share of Ns for a query? Could that maybe be supplied by a new API endpoint?

Usually, Ns don't make up 40% of a site, but sometimes they do and that can cause problems like here, where one falsely thinks there's a non-clean breakpoint.

Deltacrons with NSP3 breakpoint

Thank you for developing this incredible tool! I'm testing it out by reproducing some of the recombinant clades described in pango-designation. So far it's going really well, but I'm struggling with detecting/interpreting Deltacrons with an NSP3 breakpoint. Some context:

So far, BA.1/BA.2 recombination is straightforward to detect and interpret (regardless of breakpoint) (ex. cov-lineages/pango-designation#448)

python3 search_recombinants.py pango-designation_issue_448.fasta

With some parameter tweaking, I can also recover Deltacrons when the breakpoint involves the S gene (ex. cov-lineages/pango-designation#444). So many intermissions though!

python3 search_recombinants.py pango-designation_issue_444.fasta \
  --parents 2-4 \
  --breakpoints 1-4 \
  --max-intermission-count 50 \
  --max-intermission-length 3

But I can't reliably detect Deltacrons that do not involve the S gene, such as at NSP3. (ex. cov-lineages/pango-designation#446).

python3 search_recombinants.py pango-designation_issue_446.fasta \
  --parents 2-4 \
  --breakpoints 0-10 \
  --unique 1

Part 1:

Part 2:

RIPPLES also struggles with these. It identifies the 21I & 21K samples as recombinants (Part 1), and not the 21I & 21J & 21K (Part 2).

My interpretation is that these are "Potential Deltacrons", but there is insufficient nucleotide information to resolve it one way or another. Would you agree with this classification/interpretation?

Q: Why show all donors not just the relevant ones?

I'm analyzing one sequence and am wondering why you output all potential donors/parents, not just the two that seem most relevant here: BA.1/21J?

Are my arguments wrong? When I reduce parents to 0-5, I get not output which is weird. Don't quite understand what's going on here.

Have a look at the Δm, n, 2 statistic for breakpoint detection

I was contacted by @maciekboni who recommended the Δm, n, 2 statistic. It looks very relevant for the case of two potential parent lineages.

Still I did not find time to take a closer look, and make up my mind if/how it fits with my mid-term plans of adjusting my algorithms, especially to include the individual mutation prevalences (#15) into the calculation.

Crash related to tdqm

Originally posted by @Vjimenez-vasquez in #25 (comment):

Hi there,

I ran the following command :

python3 sc2rf.py test2.fasta --unique 1

And got the following message :

Traceback (most recent call last):
  File "sc2rf.py", line 987, in <module>
    main()
  File "sc2rf.py", line 132, in main
    reference = read_fasta('reference.fasta', None)['MN908947 (Wuhan-Hu-1/2019)']
  File "sc2rf.py", line 476, in read_fasta
    with my_tqdm(total=os.stat(path).st_size, desc="Read " + path, unit_scale=True) as pbar:
  File "sc2rf.py", line 199, in my_tqdm
    return tqdm(*margs, delay=0.1, colour="green", disable=bool(args.hide_progress), **kwargs)
  File "/home/hp/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 922, in __init__
    TqdmKeyError("Unknown argument(s): " + str(kwargs)))
tqdm.std.TqdmKeyError: "Unknown argument(s): {'delay': 0.1, 'colour': 'green'}"

Do you have any suggestion, please ?

TypeError: unsupported operand type(s) for |: 'dict' and 'dict'

Getting this error while trying to run the program:

Reading reference genome, lineage definitions...
Done.
Reading actual input.
Traceback (most recent call last):
File "search_recombinants.py", line 539, in <module>
main()
File "search_recombinants.py", line 96, in main
all_samples = all_samples | read_samples
TypeError: unsupported operand type(s) for |: 'dict' and 'dict'

ENH: Sort by breakpoint

It would really help spot clusters if one could sort by breakpoint. sc2rf knows the rough location of the breakpoint, so the output should be possible to be sorted.

If there are multiple breakpoints, sort first by the first one, then by the second.

Terminal Ns not recognized as missing

While investigating cov-lineages/pango-designation#590, I noticed that samples with the BA.2 S2M deletion (29734:29759) were being incorrectly visualized as having reference bases in sc2rf:

Consensus View:

sc2rf View:

I think this could be for a couple of reasons:

When --enable-deletions is used, perhaps deletions should not be considered missing data?

missings_matches = ["N"]
if not args.enable_deletions:
    missings_matches.append("-")

I think there is missing logic when detecting a run of Ns, to catch if that runs proceeds to the end of the genome?

if s in missings_matches:
    # we've been tracking a run of N's, this base marks the end              
    if start_n == -1:
        start_n = i  # mark the start of possible run of N's
elif start_n >= 0:
    missings.append((start_n, i-1))  # Python-style (closed, open) interval
    start_n = -1

# Missing logic to catch missing data at the end of the genome?
if i == len(reference) and s in missings_matches:
    missings.append((start_n, i-1))

With these changes, the sc2rf output more closely matches the consensus sequence/my expectation:

I think this is a bug, but if it's the intended behaviour for deletions, please let me know. Thanks!

Allow more file formats and/or access methods, i.e. Auspice v2 dataset JSON from nextstrain URLs

It seems that Auspice v2 dataset JSON have become a de-factor standard way to link to a set of samples, like in nextstrain's fetch URLs.

If I remember it correctly, that JSON can easily be traversed to get the full set of mutations for each sample. I would like to accept those URLs (either the part after nextstrain.org/fetch/ or the whole URL) as an alternative to local .fasta filenames.

Or I should accept both file formats (.fasta and Auspice v2 dataset JSON) as well as several access methods (local file name, remote URL or piped stream), in any possible combination. I might need something like a rewindable stream so that I can look at the first few bytes, decide what it is, and then parse it from the beginning.

How to interpret BA.4/BA.5 list of mutations

Hi!
How should mutations for BA.4/BA.5 be interpreted?
https://github.com/lenaschimmel/sc2rf/blob/main/virus_properties.json#L9057
Thanks,
Javier

ENH: show progress bar, say how many files were read in, how processing is going

Would be nice to see how things are going

tqdm makes this very easy with python

a bit more logging while the analysis is going would be cool too, just so that I know what's going on, instead of seeing nothing for a minute

Add relevant primer bed files and make sure the bed format is correctly interpreted

To make it easier to distinguish real recombinants and sequencing artifacts, I want to add the most relevant primers to Sc2rf. Currently, this involves adding the .bed files to the primers directory and making sure the format is correctly recognized.

Currently available and looking good

ARTIC v3, v4, v4.1
Midnight

Currently not usable (reason see below)

EasySeq

Others

I'm not sure whart other primers are relevant, and how to find out.

see https://primer-monitor.neb.com/primer_sets
I'm currently awaiting feedback from RKI about primers commonly used in Germany
there was another list/overview page in a Twitter group, but I don't rember where it is :/

Misunderstanding regarding EasySeq

I once added EasySeq, but I removed it again, because I realized that I probably misinterpreted the .bed file.

With ARTIC and Midnight, the length of the actual primer and the distance of the coordingates match, e.g. from artic_v4_1.bed:

MN908947.3 324 344 SARS-CoV-2_2_LEFT 2 + TTTACAGGTTCGCGACGTGC

len('TTTACAGGTTCGCGACGTGC') = 20 = 344 - 324

But with this EasySeq file, it does not. The primer sequences are not included in this file, but in this other bed file.

In the second file, you can see that all primers have the same length (18 nucs) but the start and end coordinates are those of the full amplicon:

Name	start insert	end insert	Left Primer	Right Primer
Ampl_001_pA	55	263	CAACTTTCGATCTCTTGT	GGACAAGGCTCTCCATCT
Ampl_002_pB	256	483	TTTMGTCCGGGTGTGACC	GCAGTTCGAGCATCCGAA
Ampl_003_pA	468	652	CTCAACTTGAACAGCCCT	ACTATGGCCACCAGCTCC

In the first file, the coordinates end just before the amplicons:

NC_045512.2	30	54	NC_045512.2	263	286	nCoV-2019_1
NC_045512.2	232	255	NC_045512.2	483	506	nCoV-2019_2
NC_045512.2	444	467	NC_045512.2	652	675	nCoV-2019_3
NC_045512.2	594	617	NC_045512.2	841	865	nCoV-2019_4

But have a varying length, bigger than 18:

54 - 30 = 24
255 - 232 = 23
467 - 444 = 27
617 - 594 = 23

I'm not sure what these longer intervals mean, and whether it is correct or useful to highlight them in Sc2rf. If I add these files and interpret them as I did with the other primers, it looks like this:

Here, for example, position 22200 is highlighted with « which indicates that the left primer for amplicon 117 is affected for all three versions of EasySeq, because the first bed file has the range 22197 to 22222. According to the other one, the amplicon starts at 22223 so I would expect the primer to start at 22223 - 18 = 22205, so that it is unaffected by the mutation at 22000.

@JordyCoolen can you help me understand this?

Release a version

I know this tool is still in development, but I'm looking forward to your first working version.

I want to create a docker container for sarscov2recombinants for the state public health bioinformaticians (https://github.com/StaPH-B/docker-builds).

Issues with VT / ANSI color codes, especially on Windows/Ubuntu

A lot of sub-issues:

hard to get it working on Windows
- recommend Windows Terminal instead
- try the os.system('') hack found on Stack Overflow
color bar of the last genome region (after N) extends to the end of line
sequence names are displayed black on black, thus invisible
unimportant stuff has much too much contrast, at least on Ubuntu (bright white)
need more unique colors when using --force-all-parents together with --clades all (at least 17 instead of currently 6)
Default colors for 2-parent-recombinants are red and green, making it unusable for users with red–green color blindness (up to 8% of users with a single X chromosome)

Bug/question with --force-all-parents --clades all

Hi there,

I just was wondering why I have no output and tried the second example from here:
https://github.com/lenaschimmel/sc2rf#no-output--some-sequences-not-shown

So I added --clades all --force-all-parent to my call, but it seems that they can't be used both:

The number of allowed parents, the number of selected clades, and the --force-all-parents conflict so that the results must be empty.

Also, --clades all can't be used as the last argument (before the input) because the input won't be recognized

Input sequences must be provided, except when rebuilding the examples. Use --help for more info. Program exits.

I'm not sure if this is only my setup/input problem.

Would you suggest to use -c all or -f? My full command is

  python3 sc2rf.py --csvfile ../${name}_sc2rf.csv --parents 1-35 --breakpoints 1-2 \
                      --max-intermission-count 3 --max-intermission-length 1 \
                      --unique 1 --max-ambiguous 10000 --max-name-length 55 \
                      ### --clades all  --force-all-parents  \ ###
                      ../${fasta}

Best
Marie

ENH: Output full internal representation of analysis result for sharing without need for recomputation

The analysis is one off at the moment, in order to see the result, I have to rerun the whole analysis.

That's fine if analysing only 1000 samples or so, but if one were to run it on all of GISAID or even only 50k samples, the waiting would be very inconvenient.

It would be nice if the acts of analysis and the act of viewing were independent.

So one could run the analysis on a server, download the results and view locally.

All that's required, I think, would be to output the internal representation of whatever you use to create the terminal output. One could start off with that simply being a pickled python object.

Or alternatively, turn it into JSON to make the output human readable and also usable for machine analysis.

As a result, one could run sc2rf wiht --output sc2rf_analysis.json to get the analysis result. Share that file, and view it with --precomputed sc2rf_analysis.json.

ENH: Differentiate between clade defining mutations and optional mutations

If I understand your script correctly, you treat all mutations that are above the user specified threshold identical.

There's room for improvement there.

It would make sense to use two kinds of mutation types for each clade:

Defining mutations that should be present in (almost) all sequences of a clade, so maybe all those mutations present >95%. If these are absent, it means there's a problem either with sequence quality or something else. Absence is very harmful.
Common mutations that sometimes occur, but whose absence does not mean much. Rather, the presence of these mutations increases the probability of a sequence belonging to the clade.

Do you know what I mean? One threshold does not suffice for both concepts.

I'll think a bit more about recombinant detection myself - maybe there are further improvements possible. This is an amazing tool already, though!

Way to pipe results to png, txt files

This is a fantastic tool, and I've already put it to good use in Arkansas to research some strange lineages. Great work!

I do have to share the visuals, and I an wondering if there is a way to pipe the results to an outside file, such as png or txt. I am more of an applied researcher, so if I missed something, I would appreciate any directions.

Again, great tool already!

Thanks,

Python version requirement 3.9

Thanks for the tool.

Just had a quick note that I think Python 3.9 is required due to the | operator in dict.

I was getting an error before trying it with 3.9.

Fix or remove 21I and 21J

The list of supported clades, which is written down in mappings.csv, was taken from nextclade-data, and contains three Delta clades: 21A, 21I and 21J. The latter two do not map to single pango lineages.

When I switched from nextclade to cov-spectrum to generate the contents of virus_properties.json, I just used mappings.csv to get the pango lineages and make requests to the cov-spectrum API. The pseudo-names "AY (higher)" and "AY (lower)" are not recognized, and the variant definitions in virus_properties.json remain empty. The consequence:

This tool claims to support 21I and 21J, but it currently does not.

To be honest, I never really got around to get an overview of the Delta/AY diversity, as I got into SARS-CoV-2 genomics when Delta was already declining and Omicron on the rise. Thus, I have no clear plan on how to solve this. I think, once I get my Delta-knowledge up to date, the technical solution might be quite easy.

ENH: repeat gene legend every say 30 samples in case of long list

when I have 100 samples, the legend with gene boundaries goes out of screen. It would be nice if the legend could be repeated.