poonlab / sierra-local Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 3.0 22.29 MB

Retrieve HIVdb algorithm as XML and apply locally to HIV sequences

License: GNU General Public License v3.0

Python 99.12% R 0.34% CSS 0.01% HTML 0.01% TeX 0.54%

sierra-local's People

Contributors

Stargazers

Watchers

Forkers

ncsutsun4 mych123 1383385

sierra-local's Issues

Parse Stanford algorithm XML

This algorithm assigns integer scores to a sequence for each antiretroviral drug (ARV) for a score-based classifier of resistance.

Installation on macOS

Running sudo python3 setup.py install for macOS (Darwin 16.7) yields the following:

/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/distutils/dist.py:261: UserWarning: Unknown distribution option: 'include_package_data'
  warnings.warn(msg)
running install
running build
running build_py
running build_scripts
running install_lib
running install_scripts
changing mode of /opt/local/Library/Frameworks/Python.framework/Versions/3.5/bin/sierralocal to 755
running install_egg_info
Removing /opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sierralocal-0.0.1-py3.5.egg-info
Writing /opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sierralocal-0.0.1-py3.5.egg-info

Hence the sierralocal binary is installed to the /opt directory where Python 3.5 lives, and not /usr/local/bin. Since /opt/local/.. is not in PATH, the user can't run the binary without specifying a path.

Script should automatically run update

If network connection is available and newer ASI version exists, then we should automatically download it to the local filesystem. If no network connection is available then program should print a warning and then proceed with the existing ASI file. (Consider distributing ASI in repo?)

User issues

>python sierralocal.py seq.fasta -o OUTPUT.json
  File "sierralocal.py", line 56
    print "scoring",query
                  ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(print "scoring",query)?

As noted by user, this is a Python2->3 issue.

>python sierralocal.py input.fasta -o OUTPUT.json
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\path..\\..\\sierra-local-master./data/HIVDB.xml'

I think the user needed to pull the XML before running sierralocal.py - provide more informative error message?

Sequence mapped to RNAse

This is the nucamino output for the sequence FJ199571:

['FJ199571.A143M.C.616', '595', '695', '1', '303', 'L607M:ATG,V621I:ATT,T625S:AGT,D626E:GAA,H638Q:CAG,L646S:TCA,S668N:AAT,S674N:AAC,Q679H:CAT,K685R:AGG,A689S:TCA', '']

Variable firstAA takes value 595. In nucaminohook.NucAminoAligner, the dictionary self.gene_map is:

{'PR': (56, 154), 'RT': (155, 594), 'IN': (715, 1003)}

so the sequence apparently falls between RT and IN.

Need to adjust coordinates for sequences spanning multiple regions

For example, for a sequence with PR and RT:

            "consensus": "V",
            "position": 134,
            "AAs": "T",

should be:

            "consensus": "V", 
            "position": 35, 
            "AAs": "T",

Mutation mixtures being scored more than once

Mixture mutations were being counted as present, and thus contributing to resistance, more than once in certain cases. This is due to the current implementation not checking if that residue has been counted in both the partial score for combination-DRMs and single DRMs.

This leads to overestimate of resistance with a small specific set of mutations.

OOP design

Thinking about converting the data structures to Classes:
i.e. list of Query(), one per query sequence, that each have attributes including list of Mutation(), NASeq, scores[ ], length, etc.,
compared to the current (messier but faster) implementation of individuals lists of mutations, scores, sequence headers, etc.

However, using objects results in reduced performance (167 items per second, down from 280 items per second), but is more readable/extensible. It's easier and more clear to work with sequence attributes in this way. Thoughts on the performance/clarity trade-off, @ArtPoon ?

Not scoring deletions correctly

Currently a deletion is identified as K70X, although the sierra pipeline labels the same feature as K70-.

script does not run new XML following forceupdate

Expected behaviour should be to run with the new XML, without having to rerun the script.

sequences with long upstream sequence including deletions are incorrectly subtyped

This results in the starting codon of PR (CCT) being omitted, as NucAmino assigns this AA position 58 (not 57). Consequently a large number of such sequences go unsubtyped because the resulting sequence is out of alignment.

Re-run validation experiments with refactored package

If the results are the same, then package release to PyPI as version 0.0.2 (#35) and switch to submitting the paper.

Regex matching negative scores

Does anyone good with regex know how to capture negative integers as part of a string?
The current expression is: scores = re.findall('[0-9]+(?=\W)', drm.strip()), which captures only the positive numbers in the drug resistance mutation string. However, this excludes the negative scores which is the last barrier I'm encountering to producing validated results.

Example content which I need to extract '-10', among the other numbers from:

Switch to platform-independent paths

Fix hard-coded directory and file paths (e.g. hivdb_file = "./data/HIVDB.xml") using Python 3.4+'s pathlib.Path(). May also need os.path. This should make sure that all the filepaths work on every OS.

Format of parsed GLOBALRANGE

The format of GLOBALRANGE in the XML file is:

<GLOBALRANGE>
(-INF TO 9 => 1, 10 TO 14 => 2, 15 TO 29 => 3, 30 TO 59 => 4, 60 TO INF => 5)
</GLOBALRANGE>

I have parsed this expression into a dictionary where the orders are keys and the ranges are values. Should the ranges and orders be represented numerically or as strings (as below)?

{'1': '-INF TO 9', '2': '10 TO 14', '3': '15 TO 29', '4': '30 TO 59', '5': '60 TO INF'}

Refactoring broke genotyping of INT sequences

Number of discordant calls went up from 0 to 147. PR and RT results are fine.
I have temporarily removed subtyping and sequence trimming (see #40) but this does not seem to be responsible for the discrepancy. Need to zero in on one of the discordant sequences and see why we're getting different results.

Insertions relative to reference need to be recorded for scoring

Post-processing pairwise alignment, we need some data structure for recording the location and composition of sequence insertions. For example:

AGTCAT---TTTGGA  # reference
AGTCATCCCTTTGGA  # query

where AGT is codon location 0 (zero-indexing), we can record as:

{'rpos': 1, 'seq': 'CCC'}

testing pip3 install on fresh Ubuntu installation

Ubuntu workstation running 4.15.0-32-generic

needed to run sudo apt install python3-pip in order to get pip3. This also installed several other packages including gcc and python3.6:

The following NEW packages will be installed:
build-essential dh-python dpkg-dev fakeroot g++ g++-7 gcc gcc-7
libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl
libasan4 libatomic1 libc-dev-bin libc6-dev libcilkrts5 libexpat1-dev
libfakeroot libgcc-7-dev libitm1 liblsan0 libmpx2 libpython3-dev
libpython3.6-dev libquadmath0 libstdc++-7-dev libtsan0 libubsan0
linux-libc-dev make manpages-dev python-pip-whl python3-dev
python3-distutils python3-lib2to3 python3-pip python3-setuptools
python3-wheel python3.6-dev

running pip3 install sierralocal ran successfully:

Collecting sierralocal
  Downloading https://files.pythonhosted.org/packages/ce/a8/2501b1b3ad9b8abc7994b7ece3d325b3d765401c4ce7f5760bda765f5267/sierralocal-0.0.1.tar.gz (1.7MB)
  100% |████████████████████████████████| 1.7MB 1.2MB/s 
Building wheels for collected packages: sierralocal
  Running setup.py bdist_wheel for sierralocal ... done
  Stored in directory: /home/art/.cache/pip/wheels/e1/2f/28/fab709a2a25b279a11774fbae50520a3dfeb370ad940e84c68
Successfully built sierralocal
Installing collected packages: sierralocal
Successfully installed sierralocal-0.0.1

but sierralocal is not in PATH:

art@Lio:~$ sierralocal
sierralocal: command not found

this is because the package was downloaded to ~/.local, which is not in the default PATH after Ubuntu 16.04. Calling the script directly works fine. We can also run the following:
```
export PATH=~/.local/bin:$PATH
```

I downloaded the scripts/retrieve_hivdb_data.py script and ran it:

python3 retrieve_hivdb_data.py RT RT.fa

to have some test data. Here are the results:

art@Lio:~$ sierralocal RT.fa
Error: could not retrieve HIVDB XML. Updating...
Unable to update HIVDB XML. Try running this script from the root directory.
Error: could not retrieve APOBEC DRM data. Updating...
Unable to update APOBEC DRMs. Try running this script from the root directory.
Traceback (most recent call last):
  File "/home/art/.local/bin/sierralocal", line 11, in <module>
    sys.exit(main())
  File "/home/art/.local/lib/python3.6/site-packages/sierralocal/__init__.py", line 56, in main
    algorithm = HIVdb()
  File "/home/art/.local/lib/python3.6/site-packages/sierralocal/hivdb.py", line 47, in __init__
    self.root = xml.parse(self.xml_filename).getroot()
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
    tree.parse(source, parser)
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 586, in parse
    source = open(source, "rb")
TypeError: expected str, bytes or os.PathLike object, not NoneType

Allow user to specify local NucAmino binary and XML, data files

To make it easier to run sierralocal without sudo privileges.
See #42

Scoring algorithm results not matching with HIVdb output

The current implementation with HXB2 pol test sequence
jasper@asus:~/git/sierra-local$ python sierra-local.py -i hxb2pol.fasta
results in the following:
{'ETR': 10, 'didanosine': 10, 'abacavir': 5, 'DTG': 5, 'EVG': 10, 'saquinavir/r': 75, 'nevirapine': 60, 'rilpivirine': 10, 'DRV/r': 30, 'SQV/r': 75, 'lopinavir/r': 90, 'AZT': 20, 'IDV/r': 90, 'atazanavir/r': 90, 'indinavir/r': 90, '3TC': 0, 'fosamprenavir/r': 75, 'FPV/r': 75, 'lamivudine': 0, 'TPV/r': 105, 'etravirine': 10, 'D4T': 20, 'ABC': 5, 'raltegravir': 60, 'tenofovir': 5, 'emtricitabine': 0, 'stavudine': 20, 'EFV': 60, 'NFV': 90, 'DDI': 10, 'nelfinavir': 90, 'dolutegravir': 5, 'tipranavir/r': 105, 'elvitegravir': 10, 'LPV/r': 90, 'NVP': 60, 'FTC': 0, 'RPV': 10, 'TDF': 5, 'zidovudine': 20, 'ATV/r': 90, 'efavirenz': 60, 'darunavir/r': 30, 'RAL': 60}

Compared to the output from HIVdb, these scores are higher than the predicted resistances:

sierrapy gives the correct output.

Double checked this with just score_alg.py, just in case this was an issue with my implementation. Will take a closer look this weekend.

Convert to Python module

Ideal use case

import sierra
json = sierra.score(<path to FASTA>)

Loss of speed in latest commit

Somehow I've managed to make things a lot slower :-P
Before (commit c51f610..f4684e5):

art@orolo:~/git/sierra-local$ sierralocal test.fa
sierralocal/data/HIVDB_8.6.1.ca26d72f.xml
Found NucAmino binary nucamino-linux-amd64
AF013440.143P8W0.B.3 0 1
AF021278.JIDP2W0.B.14 2 0
Y16151.RO-BCI9.F.24 0 4
AY802655.CA9914.B.51 0 1
AF169102.Vent31B.B.60 0 6
AF153404.MilazzoTO1.B.69 0 3
100 sequences found in file test.fa.
sierralocal/data/HIVDB_8.6.1.ca26d72f.xml
Found HIVdb file sierralocal/data/HIVDB_8.6.1.ca26d72f.xml
Writing JSON to file test-local.json
Time elapsed: 1.8673 seconds (53.553 it/s)

After (last commit cb737b4ef2a99bcbaaafdefecb06c5d15b09dddb):

art@orolo:~/git/sierra-local$ sierralocal test.fa
HIVdb version 8.6.1
Found NucAmino binary nucamino-linux-amd64
test.fa
100 sequences found in file test.fa.
Writing JSON to file test_results.json
Time elapsed: 6.5858 seconds (15.356 it/s)

Incorporate Stanford HIVDB's NucAmino aligner?

https://github.com/hivdb/NucAmino

It's written in the Go programming language, with the option of using a Docker container..

subtyper doesn't work, requires aligned sequences

PR outputs are reasonable but RT and INT are not aligned to the references.

Provide JSON viewer (HTML)

Make GUI with Tkinter?

Time permitting. I've done this before in other Python scripts:
https://github.com/cfe-lab/mixed-hcv/blob/master/mixed-hcv.py

import Tkinter as tk
import ttk
import tkFileDialog

class App:
    def __init__(self, master):
        self.__version__ = '0.1'
        self.master = master
        frame = tk.Frame(master)  # simple container to hold other widgets
        frame.pack()  # resize to contents and make visible

Hard-to-reach binary for macOS installation from source via MacPorts

Running sudo python setup.py install on macOS 10.13.3 (High Sierra) with Python 3.5.4 (via MacPorts) installs the sierralocal binary to /opt/local/Library/Frameworks/Python.framework/Versions/3.5/bin/sierralocal.
Looks like pip will do the same thing. See pypa/pip#113

Validation against sierra-client

Download a diverse set of HIV pol sequence data
Run sequences with both sierra-client and sierra local and compare JSON
Evaluate differences if any

Sequence Subtyping

See https://hivdb.stanford.edu/page/hiv-subtyper/ for the process:

If the distance to the closest matching reference sequence is greater than 11%, the subtype is reported as unknown.

If the distance to the closest matching reference sequence is below the distance upper-limit for that subtype and the closest matching reference is a pure subtype, CRF01_AE, CRF02_AG, or other non-simple CRF (as defined in the subtype properties table), then this subtype is reported to the user.

If the distance to the closest matching reference sequence is below the distance upper-limit for that subtype and the closest matching reference is a simple CRF, then the program determines whether the submitted sequence contains sufficient coverage on either side of the established breakpoint(s) for that CRF. If there is sufficient coverage, then the program reports this CRF. Otherwise, the program reports the parent subtype for the region covered by the submitted sequence.

If the distance to the closest matching reference sequence is above the distance upper-limit for that subtype and the closest matching reference is a CRF01_AE, CRF02_AG, or a simple CRF, we examine the original table for the next-closest matching parent sequence. If the distance to the parent sequence is within 1% of the closest matching sequence, then the parent sequence is reported. If the distance to the parent sequence is more than 1% greater than the distance to the recombinant, then we consider the subtype to be a URF and each of the parents of the recombinant are reported separated by a plus sign. Subtype A is considered to be the parent of CRF01_AE and subtypes A and G are considered to be the parents of CRF02_AG.

If the distance to the closest matching reference sequence is above the distance upper-limit for that subtype and the closest matching reference is a pure subtype, then this pure subtype is reported to the user with a warning indicating that further analysis using a more sophisticated subtyping program is indicated.

Subtyping slows down pipeline substantially

The benchmarks in the first version of the paper used subtype classifications from the sequence data (HIVdb predictions) - this skips the local subtyping step so naturally the algorithm runs much faster. For a fair comparison we need to measure computing time with local subtyping.

Query frameshift gap (~) issues

Sierra interprets this as a mutation to X (unknown).
NucAmino, the base of our implementation, outputs the gap as a mixture (i.e. not X) which is handled by the rest of the algorithm as the same as any other mixture.

In limited cases, this may lead to a DRM being 'present' at the deletion locus and contributing to the penalty score where the original algorithm would not.

XML woes

Hi @ArtPoon

I'm trying to run sierra-local, but I get errors if I don't include an XML:

Error: could not retrieve HIVDB XML
Unable to update HIVDB XML. Try manually downloading the HIVdb ASI2.
Traceback (most recent call last):
  File "/usr/local/bin/sierralocal", line 11, in <module>
    exit_code = main.main()
  File "/usr/local/lib/python3.5/dist-packages/sierralocal/main.py", line 157, in main
    count, time_elapsed = sierralocal(args.fasta, args.outfile, args.xml, args.skipalign, args.forceupdate)
  File "/usr/local/lib/python3.5/dist-packages/sierralocal/main.py", line 111, in sierralocal
    algorithm = HIVdb(path=xml, forceupdate=forceupdate)
  File "/usr/local/lib/python3.5/dist-packages/sierralocal/hivdb.py", line 58, in __init__
    if not os.path.isfile(Path(os.path.dirname(__file__))/'data'/'apobec.tsv'):
  File "/usr/lib/python3.5/genericpath.py", line 30, in isfile
    st = os.stat(path)
TypeError: argument should be string, bytes or integer, not PosixPath

while including the HIVdb XML gives the following error

Traceback (most recent call last):
  File "/usr/local/bin/sierralocal", line 11, in <module>
    exit_code = main.main()
  File "/usr/local/lib/python3.5/dist-packages/sierralocal/main.py", line 157, in main
    count, time_elapsed = sierralocal(args.fasta, args.outfile, args.xml, args.skipalign, args.forceupdate)
  File "/usr/local/lib/python3.5/dist-packages/sierralocal/main.py", line 111, in sierralocal
    algorithm = HIVdb(path=xml, forceupdate=forceupdate)
  File "/usr/local/lib/python3.5/dist-packages/sierralocal/hivdb.py", line 58, in __init__
    if not os.path.isfile(Path(os.path.dirname(__file__))/'data'/'apobec.tsv'):
  File "/usr/lib/python3.5/genericpath.py", line 30, in isfile
    st = os.stat(path)
TypeError: argument should be string, bytes or integer, not PosixPath

Installation from PyPI (pip) failed on fresh Ubuntu workstation

Looks like a user permissions problem:

searching path /usr/local/lib/python3.6/dist-packages/sierralocal/data/HIVDB*.xml
Error: could not find local copy of HIVDB XML, attempting download...
Unable to update HIVDB XML. Try manually downloading the HIVdb ASI2.
[Errno 13] Permission denied: '/usr/local/lib/python3.6/dist-packages/sierralocal/data/HIVDB_8.7.9e470b87.xml'
/usr/local/lib/python3.6/dist-packages/sierralocal/data/apobec.tsv
Error: could not retrieve APOBEC DRM data
Unable to update APOBEC DRMs. Try manually downloading the APOBEC DRM TSV into data/apobec.tsv
Traceback (most recent call last):
  File "/usr/local/bin/sierralocal", line 11, in <module>
    exit_code = main.main()
  File "/usr/local/lib/python3.6/dist-packages/sierralocal/main.py", line 174, in main
    cleanup=args.cleanup, forceupdate=args.forceupdate)
  File "/usr/local/lib/python3.6/dist-packages/sierralocal/main.py", line 107, in sierralocal
    algorithm = HIVdb(path=xml, forceupdate=forceupdate)
  File "/usr/local/lib/python3.6/dist-packages/sierralocal/hivdb.py", line 76, in __init__
    self.root = xml.parse(str(self.xml_filename)).getroot()
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
    tree.parse(source, parser)
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 586, in parse
    source = open(source, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'None'

Installation history:

sudo apt install python3-pip
sudo -H python3 -m pip install --index-url https://test.pypi.org/simple/ sierralocal

Sequence offset contributing to errors in mutation scoring

If the sequence spans the entire reading frame, we get the correct mutations:

Sequence Name	POL FirstAA	POL LastAA	POL FirstNA	POL LastNA	POL Mutations
DJ264.France	716	1003	1	864	K729R:AGA,V746I:ATA,S754C:TGC,L816I:ATA,T827V:GTA,T839A:GCT,T840A:GCA,G849N:AAT,I850V:GTC,K851T:ACA,V916I:ATA,T921S:TCA,T933I:ATA,L949I:ATT,N969K:AAG,R984K:AAA,S998G:GGT

This works fine and leads to correct output.

However, if the sequence is truncated, the mutations we get are off by a constant shift. Trying to figure out the math here to correct this dynamically.

Sequence Name	POL FirstAA	POL LastAA	POL FirstNA	POL LastNA	POL Mutations
T29INT.Norway	795	1003	1	627	Y798F:TTC,G821A:GCA,T827V:GTA,S834P:CCT,T840A:GCC,V841M:ATG,A848T:ACA,K851Q:CAA,Q852H:CAT,S868A:GCC,K875S:TCC,G878Q:CAG,I897V:GTT,R902RG*:NGA,S910T:ACT,I915L:TTA,V916I:ATA,I919L:TTA,T921S:TCA,D922Q:CAA,K926T:ACA,T933L:TTA,N937K:AAA,L949I:ATT,K955Q:CAG,N969K:AAA,S970G:GGG,K973E:GAG,K988Q:CAA,D993T:ACT,D1001T:ACA,D1003S:AGT

What we should get from this:
Y83F, G106A, T112V, S119P, T125A, V126M, A133T, K136Q, Q137H, S153A, K160S, G163Q, I182V, R187R*G, S195T, I200L, V201I, I204L, T206S, D207Q, K211T, T218L, N222K, L234I, K240Q, N254K, S255G, K258E, K273Q, D278T, D286T

What we get instead:
D128Q K132T D209S T139L N175K N143K Y4F G27A T33V L155I S40P D199T T46A V47M S176G K179E A54T K57Q Q58H K194Q K161Q S74A D207T K81S G84Q I103V R108R S116T I121L V122I I125L T127S

They're off by 79. This leads to the wrong mutations being scored/omitted in the algorithm

Validation status

Using the testing script, I ran ~1000 consecutive sequences from HIVdb of each of PR, IN, RT on both sierra-local and sierrapy, and compared the outputs:

Gene	Absolute Match Ratio
PR	0.983
IN	1.000
RT	0.950

Remaining issues are related to polymorphisms and idiosyncrasies in how sierrapy gets its information versus what I get. When an error exists, they are systematically higher than expected by several specific intervals, leading me to suspect that it's the accidental inclusion of a set of specific DRMs in cases where sierrapy would not

Clean up repo

massive TSV file needs to be removed
other data files should be pulled from web via script if possible (for local storage and updating)

Move subtyper.py to another branch

Trim low quality nucleotides in the start and end of each sequence

See sierra/NucAminoAligner.java for the method:

	/**
	 *  Input sequence may contain non-POL NAs in the beginning and the end, e.g. AF442565, KF134931.
	 *
	 * Following code did two things:
	 *
	 *   1. Remove large (length > SEQUENCE_TRIM_SITES_CUTOFF) low quality pieces from gene sequence
	 *   2. Keep small (length <= SEQUENCE_TRIM_SITES_CUTOFF) low quality pieces from gene sequence
	 *
	 * A site is considered "low quality" if it:
	 *   - is unusual mutation;
	 *   - has "X" in aas; or
	 *   - has stop codon
	 */
	private static int[] trimLowQualities(
			Sequence sequence, int firstAA, int lastAA,
			Collection<Mutation> mutations, Collection<FrameShift> frameShifts) {
		int badPcnt;
		int trimLeft = 0;
		int trimRight = 0;
		int problemSites = 0;
		int sinceLastBadQuality = 0;
		int proteinSize = lastAA - firstAA + 1;
		List<Integer> candidates = new ArrayList<>();
		List<Boolean> invalidSites = new ArrayList<>(Collections.nCopies(proteinSize, false));
		for (Mutation mut : mutations) {
			int idx = mut.getPosition() - firstAA;
			if (!mut.isUnsequenced() && (
					mut.getHighestMutPrevalence() < SEQUENCE_SHRINKAGE_BAD_QUALITY_MUT_PREVALENCE
					|| mut.getAAs().equals("X") || mut.isApobecMutation() || mut.hasStop())) {
				invalidSites.set(idx, true);
			}
		}
		for (FrameShift fs : frameShifts) {
			int idx = fs.getPosition() - firstAA;
			invalidSites.set(idx,  true);
		}
		// forward scan for trimming left
		for (int idx=0; idx < proteinSize; idx ++) {
			if (sinceLastBadQuality > SEQUENCE_SHRINKAGE_WINDOW) {
				break;
			} else if (invalidSites.get(idx)) {
				problemSites ++;
				trimLeft = idx + 1;
				badPcnt = trimLeft > 0 ? problemSites * 100 / trimLeft : 0;
				if (badPcnt > SEQUENCE_SHRINKAGE_CUTOFF_PCNT) {
					candidates.add(trimLeft);
				}
				sinceLastBadQuality = 0;
			} else {
				sinceLastBadQuality ++;
			}
		}
		trimLeft = candidates.size() > 0 ? candidates.get(candidates.size() - 1) : 0;
		candidates.clear();
		// backward scan for trimming right
		problemSites = 0;
		sinceLastBadQuality = 0;
		for (int idx=proteinSize-1; idx > -1; idx --) {
			if (sinceLastBadQuality > SEQUENCE_SHRINKAGE_WINDOW) {
				break;
			} else if (invalidSites.get(idx)) {
				problemSites ++;
				trimRight = proteinSize - idx;
				badPcnt = trimRight > 0 ? problemSites * 100 / trimRight : 0;
				if (badPcnt > SEQUENCE_SHRINKAGE_CUTOFF_PCNT) {
					candidates.add(trimRight);
				}
				sinceLastBadQuality = 0;
			} else {
				sinceLastBadQuality ++;
			}
		}
		trimRight = candidates.size() > 0 ? candidates.get(candidates.size() - 1) : 0;
		return new int[]{trimLeft, trimRight};
}

sierralocal fails on full-length pol sequence

Testing on LANL subtype reference pol sequence set:

HIVdb version 8.6.1
Found NucAmino binary /usr/local/lib/python3.6/dist-packages/sierralocal/bin/nucamino-linux-amd64
Aligned pol.fa
Traceback (most recent call last):
  File "/usr/local/bin/sierralocal", line 11, in <module>
    exit_code = main.main()
  File "/usr/local/lib/python3.6/dist-packages/sierralocal/main.py", line 160, in main
    count, time_elapsed = sierralocal(args.fasta, args.outfile, args.xml, args.skipalign, args.forceupdate)
  File "/usr/local/lib/python3.6/dist-packages/sierralocal/main.py", line 132, in sierralocal
    file_trims, subtypes = scorefile(input_file, database, skipalign)
  File "/usr/local/lib/python3.6/dist-packages/sierralocal/main.py", line 75, in scorefile
    aligner.get_mutations(input_file, do_subtyping=True)
  File "/usr/local/lib/python3.6/dist-packages/sierralocal/nucaminohook.py", line 178, in get_mutations
    assert gene is not None, "Fatal error in get_mutations"
AssertionError: Fatal error in get_mutations

Update installation instructions

esp. after testing on Windows and macOS

Apply algorithm to an aligned amino acid sequence

This would require the following steps:

Align nucleotide sequence to reference (see #3)
Translate aligned sequence into amino acids
Look up amino acid positions and identities (e.g., leucine at reference position 30 in RT) in parsed XML object
Tally the total resistance prediction score.

Polymorphism interpretation

I've tracked down most of the bugs and errors. Currently sierra-local can process 97-100% of a large sample size verified against sierrapy. However, the main issue seems to be with polymorphic variations.

Example:
Nucamino returns this mutation list for 71-1.Cameroon :

[...], Q95KNRSQH:MRK, [...]

I interpret this currently as Q at residue 95 -> any one of KNRSQH. However, sierrapy interprets this as Q95X. This causes me to wrongfully add a score if any of Q95KNRSQH are associated with a drug resistance. In other cases, sierrapy interprets polymorphisms in the way I currently do, and there are no issues there.

Is there a standard way to determine when an amino acid is considered "unknown--X" vs just a list of polymorphisms?

There's also an issue with sierrapy choosing some polymorphisms over others.
Let's say a sequence has the mutation V82TA:RCC. Sierrapy sometimes chooses to exclude the score for V82T alone, and other times uses either V82A or V82T in its scoring. I'm not exactly sure how they determine this. Will keep digging.

Plan return object (data structure) from scoring script

This should be a dictionary. We can base the structure of this on the XML output of the HIVdb algorithm:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<DrugResistance_Interpretation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://hivdb.stanford.edu/DR/schema/sierra.xsd">
    <algorithmName>HIVDB</algorithmName>
    <algorithmVersion>8.4</algorithmVersion>
    <webServiceVersion>2.0</webServiceVersion>
    <schemaVersion>1.1</schemaVersion>
    <submissionName/>
    <dateTime>2017-08-22 18:19:47.978</dateTime>
    <result>
        <success>true</success>
        <inputSequence>
            <md5sum>f8a9b4e6cfb7de1b4be1cb0eabd1a1cd</md5sum>
            <name>userinput unamed sample: 1</name>
            <sequence>CCTCAGGTCACTCTTTGGCAACGACCCCTCGTCACAATAAAGATAGGGGGGCAACTAAAGGAAGCTCTATTAGATACAGGAGCAGATGATACAGTATTAGAAGAAATGAGTTTGCCAGGAAGATGGAAACCAAAAATGATAGGGGGAATTGGAGGTTTTATCAAAGTAAGACAGTATGATCAGATACTCATAGAAATCTGTGGACATAAAGCTATAGGTACAGTATTAGTAGGACCTACACCTGTCAACATAATTGGAAGAAATCTGTTGACTCAGATTGGTTGCACTTTAAATTTTCCCATTAGCCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAAGTTAAACAATGGCCATTGACAGAAGAAAAAATAAAAGCATTAGTAGAAATTTGTACAGAGATGGAAAAGGAAGGGAAAATTTCAAAAATTGGGCCTGAAAATCCATACAATACTCCAGTATTTGCCATAAAGAAAAAAGACAGTACTAAATGGAGAAAATTAGTAGATTTCAGAGAACTTAATAAGAGAACTCAAGACTTCTGGGAAGTTCAATTAGGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGATGTGGGTGATGCATATTTTTCAGTTCCCTTAGATGAAGACTTCAGGAAGTATACTGCATTTACCATACCTAGTATAAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGCTTCCACAGGGATGGAAAGGATCACCAGCAATATTCCAAAGTAGCATGACAAAAATCTTAGAGCCTTTTAGAAAACAAAATCCAGACATAGTTATCTATCAATACATGGATGATTTGTATGTAGGATCTGACTTAGAAATAGGGCAGCATAGAACAAAAATAGAGGAGCTGAGACAACATCTGTTGAGGTGGGGACTTACCACACCAGACAAAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACAGTACAGCCTATAGTGCTGCCAGAAAAAGACAGCTGGACTGTCAATGACATACAGAAGTTAGTGGGGAAATTGAATTGGGCAAGTCAGATTTACCCAGGGATTAAAGTAAGGCAATTATGTAAACTCCTTAGAGGAACCAAAGCACTAACAGAAGTAATACCACTAACAGAAGAAGCAGAG</sequence>
        </inputSequence>
        <GAHypermutated>false</GAHypermutated>
        <geneData>
            <gene>PR</gene>
            <present>true</present>
            <consensus>PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF</consensus>
            <alignedNASequence>CCTCAGGTCACTCTTTGGCAACGACCCCTCGTCACAATAAAGATAGGGGGGCAACTAAAGGAAGCTCTATTAGATACAGGAGCAGATGATACAGTATTAGAAGAAATGAGTTTGCCAGGAAGATGGAAACCAAAAATGATAGGGGGAATTGGAGGTTTTATCAAAGTAAGACAGTATGATCAGATACTCATAGAAATCTGTGGACATAAAGCTATAGGTACAGTATTAGTAGGACCTACACCTGTCAACATAATTGGAAGAAATCTGTTGACTCAGATTGGTTGCACTTTAAATTTT</alignedNASequence>
            <alignedAASequence>PQVTLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF</alignedAASequence>
            <firstAA>1</firstAA>
            <lastAA>99</lastAA>
            <subtype>
                <type>B (0.00%)</type>
                <percentSimilarity>100</percentSimilarity>
            </subtype>
            <mutation>
                <classification>OTHER</classification>
                <type>mutation</type>
                <mutationString>I3V</mutationString>
                <wildType>I</wildType>
                <position>3</position>
                <nucleicAcid>GTC</nucleicAcid>
                <translatedNA>V</translatedNA>
            </mutation>
            <mutation>
                <classification>OTHER</classification>
                <type>mutation</type>
                <mutationString>N37S</mutationString>
                <wildType>N</wildType>
                <position>37</position>
                <nucleicAcid>AGT</nucleicAcid>
                <translatedNA>S</translatedNA>
            </mutation>
            <quality/>
        </geneData>
        <geneData>
            <gene>RT</gene>
            <present>true</present>
            <consensus>PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDKDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLNWASQIYAGIKVKQLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTEAVQKIATESIVIWGKTPKFKLPIQKETWEAWWTEYWQATWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVTDRGRQKVVSLTDTTNQKTELQAIHLALQDSGLEVNIVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGIRKVL</consensus>
            <alignedNASequence>CCCATTAGCCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAAGTTAAACAATGGCCATTGACAGAAGAAAAAATAAAAGCATTAGTAGAAATTTGTACAGAGATGGAAAAGGAAGGGAAAATTTCAAAAATTGGGCCTGAAAATCCATACAATACTCCAGTATTTGCCATAAAGAAAAAAGACAGTACTAAATGGAGAAAATTAGTAGATTTCAGAGAACTTAATAAGAGAACTCAAGACTTCTGGGAAGTTCAATTAGGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGATGTGGGTGATGCATATTTTTCAGTTCCCTTAGATGAAGACTTCAGGAAGTATACTGCATTTACCATACCTAGTATAAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGCTTCCACAGGGATGGAAAGGATCACCAGCAATATTCCAAAGTAGCATGACAAAAATCTTAGAGCCTTTTAGAAAACAAAATCCAGACATAGTTATCTATCAATACATGGATGATTTGTATGTAGGATCTGACTTAGAAATAGGGCAGCATAGAACAAAAATAGAGGAGCTGAGACAACATCTGTTGAGGTGGGGACTTACCACACCAGACAAAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACAGTACAGCCTATAGTGCTGCCAGAAAAAGACAGCTGGACTGTCAATGACATACAGAAGTTAGTGGGGAAATTGAATTGGGCAAGTCAGATTTACCCAGGGATTAAAGTAAGGCAATTATGTAAACTCCTTAGAGGAACCAAAGCACTAACAGAAGTAATACCACTAACAGAAGAAGCAGAG</alignedNASequence>
            <alignedAASequence>PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAE</alignedAASequence>
            <firstAA>1</firstAA>
            <lastAA>300</lastAA>
            <subtype>
                <type>B (0.00%)</type>
                <percentSimilarity>100</percentSimilarity>
            </subtype>
            <mutation>
                <classification>OTHER</classification>
                <type>mutation</type>
                <mutationString>K122E</mutationString>
                <wildType>K</wildType>
                <position>122</position>
                <nucleicAcid>GAA</nucleicAcid>
                <translatedNA>E</translatedNA>
            </mutation>
            <mutation>
                <classification>OTHER</classification>
                <type>mutation</type>
                <mutationString>F214L</mutationString>
                <wildType>F</wildType>
                <position>214</position>
                <nucleicAcid>CTT</nucleicAcid>
                <translatedNA>L</translatedNA>
            </mutation>
            <mutation>
                <classification>OTHER</classification>
                <type>mutation</type>
                <mutationString>A272P</mutationString>
                <wildType>A</wildType>
                <position>272</position>
                <nucleicAcid>CCA</nucleicAcid>
                <translatedNA>P</translatedNA>
            </mutation>
            <mutation>
                <classification>OTHER</classification>
                <type>mutation</type>
                <mutationString>K277R</mutationString>
                <wildType>K</wildType>
                <position>277</position>
                <nucleicAcid>AGG</nucleicAcid>
                <translatedNA>R</translatedNA>
            </mutation>
            <quality/>
        </geneData>
        <geneData>
            <gene>IN</gene>
            <present>false</present>
        </geneData>
        <sequenceQualityCounts>
            <insertions>0</insertions>
            <deletions>0</deletions>
            <ambiguous>0</ambiguous>
            <stops>0</stops>
            <frameshifts>0</frameshifts>
        </sequenceQualityCounts>
        <drugScore>
            <drugCode>ATV/r</drugCode>
            <genericName>atazanavir/r</genericName>
            <type>PI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>DRV/r</drugCode>
            <genericName>darunavir/r</genericName>
            <type>PI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>FPV/r</drugCode>
            <genericName>fosamprenavir/r</genericName>
            <type>PI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>IDV/r</drugCode>
            <genericName>indinavir/r</genericName>
            <type>PI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>LPV/r</drugCode>
            <genericName>lopinavir/r</genericName>
            <type>PI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>NFV</drugCode>
            <genericName>nelfinavir</genericName>
            <type>PI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>SQV/r</drugCode>
            <genericName>saquinavir/r</genericName>
            <type>PI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>TPV/r</drugCode>
            <genericName>tipranavir/r</genericName>
            <type>PI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>ABC</drugCode>
            <genericName>abacavir</genericName>
            <type>NRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>AZT</drugCode>
            <genericName>zidovudine</genericName>
            <type>NRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>D4T</drugCode>
            <genericName>stavudine</genericName>
            <type>NRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>DDI</drugCode>
            <genericName>didanosine</genericName>
            <type>NRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>FTC</drugCode>
            <genericName>emtricitabine</genericName>
            <type>NRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>3TC</drugCode>
            <genericName>lamivudine</genericName>
            <type>NRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>TDF</drugCode>
            <genericName>tenofovir</genericName>
            <type>NRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>EFV</drugCode>
            <genericName>efavirenz</genericName>
            <type>NNRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>ETR</drugCode>
            <genericName>etravirine</genericName>
            <type>NNRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>NVP</drugCode>
            <genericName>nevirapine</genericName>
            <type>NNRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <drugScore>
            <drugCode>RPV</drugCode>
            <genericName>rilpivirine</genericName>
            <type>NNRTI</type>
            <score>0</score>
            <resistanceLevel>1</resistanceLevel>
            <resistanceLevelText>Susceptible</resistanceLevelText>
            <threeStepResistanceLevel>S</threeStepResistanceLevel>
        </drugScore>
        <scoreTable>
            <scoreRow>
                <score value="PI"/>
                <score value="ATV/r"/>
                <score value="DRV/r"/>
                <score value="FPV/r"/>
                <score value="IDV/r"/>
                <score value="LPV/r"/>
                <score value="NFV"/>
                <score value="SQV/r"/>
                <score value="TPV/r"/>
            </scoreRow>
            <scoreRow>
                <score value="Total:"/>
                <score class="PI" drug="ATV/r" value="0"/>
                <score class="PI" drug="DRV/r" value="0"/>
                <score class="PI" drug="FPV/r" value="0"/>
                <score class="PI" drug="IDV/r" value="0"/>
                <score class="PI" drug="LPV/r" value="0"/>
                <score class="PI" drug="NFV" value="0"/>
                <score class="PI" drug="SQV/r" value="0"/>
                <score class="PI" drug="TPV/r" value="0"/>
            </scoreRow>
        </scoreTable>
        <scoreTable>
            <scoreRow>
                <score value="NRTI"/>
                <score value="ABC"/>
                <score value="AZT"/>
                <score value="D4T"/>
                <score value="DDI"/>
                <score value="FTC"/>
                <score value="3TC"/>
                <score value="TDF"/>
            </scoreRow>
            <scoreRow>
                <score value="Total:"/>
                <score class="NRTI" drug="ABC" value="0"/>
                <score class="NRTI" drug="AZT" value="0"/>
                <score class="NRTI" drug="D4T" value="0"/>
                <score class="NRTI" drug="DDI" value="0"/>
                <score class="NRTI" drug="FTC" value="0"/>
                <score class="NRTI" drug="3TC" value="0"/>
                <score class="NRTI" drug="TDF" value="0"/>
            </scoreRow>
        </scoreTable>
        <scoreTable>
            <scoreRow>
                <score value="NNRTI"/>
                <score value="EFV"/>
                <score value="ETR"/>
                <score value="NVP"/>
                <score value="RPV"/>
            </scoreRow>
            <scoreRow>
                <score value="Total:"/>
                <score class="NNRTI" drug="EFV" value="0"/>
                <score class="NNRTI" drug="ETR" value="0"/>
                <score class="NNRTI" drug="NVP" value="0"/>
                <score class="NNRTI" drug="RPV" value="0"/>
            </scoreRow>
        </scoreTable>
    </result>
</DrugResistance_Interpretation>

parse_definitions may not be robust

The root of the definitions is the 'DEFINITIONS' element in root.getchildren() (index 3).

>>> root.getchildren()
[<Element 'ALGNAME' at 0x10355e3b8>, <Element 'ALGVERSION' at 0x10355e318>, <Element 'ALGDATE' at 0x10355e368>, <Element 'DEFINITIONS' at 0x10355e408>, <Element 'DRUG' at 0x1037800e8>, <Element 'DRUG' at 0x103780368>, <Element 'DRUG' at 0x1037805e8>, <Element 'DRUG' at 0x103780868>, <Element 'DRUG' at 0x103780ae8>, <Element 'DRUG' at 0x103780d68>, <Element 'DRUG' at 0x103784f48>, <Element 'DRUG' at 0x103784d68>, <Element 'DRUG' at 0x103784ae8>, <Element 'DRUG' at 0x103784868>, <Element 'DRUG' at 0x1037845e8>, <Element 'DRUG' at 0x103784368>, <Element 'DRUG' at 0x1037840e8>, <Element 'DRUG' at 0x103786e08>, <Element 'DRUG' at 0x103786b88>, <Element 'DRUG' at 0x103786908>, <Element 'DRUG' at 0x103786688>, <Element 'DRUG' at 0x103786408>, <Element 'DRUG' at 0x103786188>, <Element 'DRUG' at 0x10377dea8>, <Element 'DRUG' at 0x10377dc28>, <Element 'DRUG' at 0x10377d9a8>, <Element 'MUTATION_COMMENTS' at 0x10377d728>]
>>> root.getchildren()[3]
<Element 'DEFINITIONS' at 0x10355e408>

Currently, I iterate over the definition elements with:

for element in root.getchildren()[3]:

It would be better to get the index of the 'DEFINITIONS' element from the list rather than hardcoding [3]. The issue I'm having is that I am not able to use a regular expression to look for the item in the list that contains DEFINITION, since the items are not str type.

>>> type(root.getchildren()[3])
<class 'xml.etree.ElementTree.Element'>

Generate aligned amino acid sequence from nucleotide sequence input

Use ArtPoon/gotoh2 for pairwise alignment to reference

do Procrustean pairwise alignment (align to reference and discard all positions that are insertions relative to the reference)

GUI script broken

sierralocal.py doesn't exist any more
replacing subprocess calls with direct calls to module

Pull comments files dynamically

There are some resources that we have obtained once from the Stanford site, e.g., whether mutations are major or minor resistance variants -- ideally sierra-local should check the existing files against online resources and update dynamically when necessary.

Classification of mutations (to-do)

From HIVdb FAQ:

The HIV Drug Resistance Database uses several different classification schemes. The HIVdb program and several of the query pages (such as detailed treatment queries, detailed mutation queries, advanced queries, detailed phenotype queries, and reference page) use a 3-way classification scheme that classifies protease mutations into "Major", "Minor", and "Other", RT mutations into "NRTI", "NNRTI", and "Other" and integrase mutations into "Major", "Minor" and "Other". According to this scheme, "NRTI" and "NNRTI" mutations cannot be further subdivided without the output becoming too unwieldy requiring 5 rather than 3 categories. Moreover, the original sub-classification of mutations into "Major" and "Minor" categories began with the PIs and so most HIV drug resistance researchers and clinicians are comfortable with this classification scheme.

Make sure to process the mutations when writing the JSON into these classifications, which depends on their location. There isn't a readily-available dataset that allows for this. I either have to scrape the HIVdb website (example) or find a way to do it using the current XML-based algo.

Subtyping script fails on one item

@jzpero wrote a small test method under subtyper.main() that calculates the closest reference sequence to each sequence in the same set -- in other words, a sequence's closest match should be itself. The following sequence fails this criterion for both v0.0.1 and the current refactoring:

AH011123|A|HXB22253-5093|year:1997|country:CM|author:None CRF22_01A1 A

It isn't clear whether the original HIVdb subtyper would obtain the same result.

Score maximization bug

There's a minor bug where overlapping DRM combinations aren't being selected to give the highest penalty score.
Example:

Combination	DRMs	Score
1	M41L + T215F	15
2	M41L + T215Y	5

Here, combo 1 should be chosen if the mutation at 215 is a mixture of FY, since the penalty score is the highest. However, the algorithm encounters combo 2 first, then excludes combo 1 since a mutation at 41 and 215 has already been recorded.

solution being explored: record ALL mutations that have scores and are present, and then if overlaps are present, select the one with the maximal absolute score, leaving the other ones behind