arvkevi / ezancestry Goto Github PK

Easy genetic ancestry predictions in Python

Home Page: https://ezancestry.streamlit.app

License: MIT License

Python 27.33% Shell 0.07% Jupyter Notebook 72.58% Procfile 0.02%

genomics genomics-visualization personal-genomics streamlit data-visualization dimensionality-reduction ancestry genotypes

ezancestry's Introduction

Hello, 👋 my name is Kevin Arvai. I'm a data scientist with 10+ years of experience in the genomics field.

Connect 🤝
Since you're here, let me know you stopped by. Share your Python or data science story with me on Twitter or LinkedIn. I love hearing about what people are working on in the open-source community!

Favorite project 🧬
I wrote an app that predicts users' ancestry from their genetic data.

Non-GitHub stuff 💻
I like machine learning, open-source software/data, and genomics.
My Real Python articles, blog posts, and Kaggle profile.

ezancestry's People

Contributors

Stargazers

Watchers

Forkers

jefferyansah apriha lisaiceland peppyguy datascienceecuador mcuevasg nadia-el knmkr dowing ancestrydna

ezancestry's Issues

use static 1kg samples file

Stop using ftp site to populate dfsamples.

Dependency issues installing with pip

There seems to be a dependency issue that occurs during processing after installing ezancestry with pip, related to cyvcf2. See here. After installing ezancestry and downgrading to cyvcf2==0.30.14, the issue is resolved.

Move population toggle

Thanks to Amanda Kelly for the suggestion.

Move the population toggle from the sidebar to adjacent the figure.

check app logs

Get unit tests working on Actions

Write additional unit tests to test the basic functionality of using ezancestry as a library.

Test failures with Python 3.10

snps cron test jobs with ezancestry recently started failing on Github Actions Python 3.10 setups:

Last working build (06/18/2022): https://github.com/apriha/snps/actions/runs/2522809062
First broken build (06/25/2022): https://github.com/apriha/snps/actions/runs/2563003443

Here is an example of the error:

 =================================== FAILURES ===================================
 ____________________________ TestSnps.test_ancestry ____________________________
 __init__.pxd:942: in numpy.import_array
     ???
 E   RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
 
 During handling of the above exception, another exception occurred:
 tests/test_snps.py:486: in test_ancestry
     self._make_ancestry_assertions(s.predict_ancestry())
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/snps/snps.py:1683: in predict_ancestry
     from ezancestry.commands import predict
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/ezancestry/commands.py:10: in <module>
     from sklearn.model_selection import train_test_split
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/sklearn/__init__.py:82: in <module>
     from .base import clone
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/sklearn/base.py:17: in <module>
     from .utils import _IS_32BIT
 /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/sklearn/utils/__init__.py:22: in <module>
     from .murmurhash import murmurhash3_32
 sklearn/utils/murmurhash.pyx:27: in init sklearn.utils.murmurhash
     ???
 __init__.pxd:944: in numpy.import_array
     ???
 E   ImportError: numpy.core.multiarray failed to import

I tried fixing this build by installing different combinations of dependencies (new versions of numpy and pandas were released on 06/22/2022 and 06/23/2022, respectively), but no luck yet.

Number of snps covered

Inform the user how many AIsnps they have in their uploaded sample.
It will either be n/55 (Kidd) or n/128 (Seldin)

Using latest 1k Genome dataset for building model

Hi,
I'm exploring adding an ethnic background to sample QC pipeline I'm working on and this tool seems to check all the boxes
I was involved in making re-analysis of the latest addition of samples for the 1k Genome project available using Illumina DRAGEN available on AWS.

DRAGEN reanalysis of the 1000 Genomes Dataset now available on the Registry of Open Data

Although I can't find similar complete aggregate bed file as the ones pulled by your fetch script, do you think your pipeline can easily be modified to create a model using this updated data set?
Would love to hear your thoughts on cost/benefit of this approach.

Thanks,
Daniel Brami

Unable to load umap model

Hi, I'm unable to load the included umap models. PCA models work.

When running the following code,

# write all the super population dimred models for kidd and Seldin
for aisnps_set, df, df_labels in zip(
    ["kidd", "Seldin"], 
    [df_kidd_encoded, df_seldin_encoded], 
    [df_kidd["superpopulation"], df_seldin["superpopulation"]]
):
    for algorithm, labels in zip(["pca", "umap", "nca"], [None, None, None, df_labels]):
        print(algorithm,aisnps_set,OVERWRITE_MODEL,labels)
        df_reduced = dimensionality_reduction(df, algorithm=algorithm, aisnps_set=aisnps_set, overwrite_model=OVERWRITE_MODEL, labels=labels, population_level="super population")
        knn_model = train(df_reduced, df_labels, algorithm=algorithm, aisnps_set=aisnps_set, k=9, population_level="superpopulation", overwrite_model=OVERWRITE_MODEL)

I get the error below:

2022-08-22 17:16:03.823 | INFO     | ezancestry.dimred:dimensionality_reduction:126 - Successfully loaded a dimensionality reduction model
pca kidd False None
umap kidd False None
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [17], in <cell line: 2>()
      7 for algorithm, labels in zip(["pca", "umap", "nca"], [None, None, None, df_labels]):
      8     print(algorithm,aisnps_set,OVERWRITE_MODEL,labels)
----> 9     df_reduced = dimensionality_reduction(df, algorithm=algorithm, aisnps_set=aisnps_set, overwrite_model=OVERWRITE_MODEL, labels=labels, population_level="super population")
     10     knn_model = train(df_reduced, df_labels, algorithm=algorithm, aisnps_set=aisnps_set, k=9, population_level="superpopulation", overwrite_model=OVERWRITE_MODEL)

File ~/ezancestry/ezancestry/dimred.py:107, in dimensionality_reduction(df, algorithm, aisnps_set, n_components, overwrite_model, labels, population_level, models_directory, random_state)
    105 if algorithm in set(["pca", "umap"]):
    106     try:
--> 107         reducer = joblib.load(
    108             models_directory.joinpath(f"{algorithm}.{aisnps_set}.bin")
    109         )
    110     except FileNotFoundError:
    111         return None

File ~/opt/anaconda3/lib/python3.9/site-packages/joblib/numpy_pickle.py:587, in load(filename, mmap_mode)
    581             if isinstance(fobj, str):
    582                 # if the returned file object is a string, this means we
    583                 # try to load a pickle file generated with an version of
    584                 # Joblib so we load it with joblib compatibility function.
    585                 return load_compatibility(fobj)
--> 587             obj = _unpickle(fobj, filename, mmap_mode)
    588 return obj

File ~/opt/anaconda3/lib/python3.9/site-packages/joblib/numpy_pickle.py:506, in _unpickle(fobj, filename, mmap_mode)
    504 obj = None
    505 try:
--> 506     obj = unpickler.load()
    507     if unpickler.compat_mode:
    508         warnings.warn("The file '%s' has been generated with a "
    509                       "joblib version less than 0.10. "
    510                       "Please regenerate this pickle file."
    511                       % filename,
    512                       DeprecationWarning, stacklevel=3)

File ~/opt/anaconda3/lib/python3.9/pickle.py:1212, in _Unpickler.load(self)
   1210             raise EOFError
   1211         assert isinstance(key, bytes_types)
-> 1212         dispatch[key[0]](self)
   1213 except _Stop as stopinst:
   1214     return stopinst.value

File ~/opt/anaconda3/lib/python3.9/pickle.py:1589, in _Unpickler.load_reduce(self)
   1587 args = stack.pop()
   1588 func = stack[-1]
-> 1589 stack[-1] = func(*args)

File ~/opt/anaconda3/lib/python3.9/site-packages/numba/core/serialize.py:97, in _unpickle__CustomPickled(serialized)
     92 def _unpickle__CustomPickled(serialized):
     93     """standard unpickling for `_CustomPickled`.
     94 
     95     Uses `NumbaPickler` to load.
     96     """
---> 97     ctor, states = loads(serialized)
     98     return _CustomPickled(ctor, states)

AttributeError: Can't get attribute '_rebuild_function' on <module 'numba.core.serialize' from '/Users/jacksonc08/opt/anaconda3/lib/python3.9/site-packages/numba/core/serialize.py'>

I have tested that it is certainly the UMAP model that is causing the issue.

import pandas as pd

import joblib

obj = joblib.load(r"/Users/jacksonc08/ezancestry/data/models/umap.kidd.bin")

This gives the same error.

Looking online, it seems to be an issue with the numba package (a dependency of joblib), which no longer includes the _rebuild_function function. See here.

Do you have any recommendations on how to fix this error? Many thanks.

Add the option to display ancestry labels.

Thanks to Amanda Kelly for the suggestion.

How we can use our training set?

Hello @arvkevi,
Thank you for providing this software.
I wonder how we can use another training set (instead of the 1000 genome) in your software?
Regards,
Siavash

Script erroring on single VCF input

Hi Kevin,
Here's the output from running script on a .VCF file:
(dnafinger) (base) ➜ VCFs ls HG01082.hard-filtered.vcf (dnafinger) (base) ➜ VCFs ezancestry predict ./ no SNPs loaded... 2022-06-15 08:44:45.149 | DEBUG | ezancestry.process:_input_to_dataframe:260 - No snps found in the input_data 2022-06-15 08:44:45.159 | DEBUG | ezancestry.process:process_user_input:194 - 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte 2022-06-15 08:44:45.159 | WARNING | ezancestry.process:process_user_input:195 - Skipping .DS_Store because it was not valid 2022-06-15 08:44:45.170 | INFO | ezancestry.process:encode_genotypes:137 - Successfully loaded an encoder from /Users/bramid/.ezancestry/data/models/one_hot_encoder.kidd.bin Traceback (most recent call last): File "/Users/bramid/dnafinger/bin/ezancestry", line 8, in <module> sys.exit(app()) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/typer/main.py", line 214, in __call__ return get_command(self)(*args, **kwargs) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper return callback(**use_params) # type: ignore File "/Users/bramid/dnafinger/lib/python3.9/site-packages/ezancestry/commands.py", line 288, in predict snpsdf = encode_genotypes( File "/Users/bramid/dnafinger/lib/python3.9/site-packages/ezancestry/process.py", line 140, in encode_genotypes X = ohe.transform(df.values) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 471, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown, File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 113, in _transform X_list, n_samples, n_features = self._check_X( File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py", line 44, in _check_X X_temp = check_array(X, dtype=None, File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, **kwargs) File "/Users/bramid/dnafinger/lib/python3.9/site-packages/sklearn/utils/validation.py", line 726, in check_array raise ValueError("Found array with %d sample(s) (shape=%s) while a" ValueError: Found array with 0 sample(s) (shape=(0, 55)) while a minimum of 1 is required.

I'm attaching the first 1k lines of this VCF in case it's a problem with VCF, not the code.
The complete VCF can be downloaded as so:
aws s3 cp --no-sign-request s3://1000genomes-dragen-3.7.6/data/individuals/hg19-graph-based/HG01082/HG01082.hard-filtered.vcf.gz .

1k_HG01082.hard-filtered.vcf.gz

Bump Python version

Heroku-22 stack does not support 3.8.6 specified in runtime.txt.

fix streamlit app

File "/home/appuser/venv/lib/python3.7/site-packages/streamlit/script_runner.py", line 379, in _run_script
    exec(code, module.__dict__)
File "/app/ezancestry/streamlit/app.py", line 12, in <module>
    from util import (
File "streamlit/util.py", line 4, in <module>
    import vcf
File "/home/appuser/venv/lib/python3.7/site-packages/vcf/__init__.py", line 166, in <module>
    from parser import Reader, Writer

Add to awesome-streamlit

Support for compressed VCF files

Hi Kevin,
I started to try playing with the script but I'm running into a few speed bumps that I thought I would share in different tickets):

does not support .vcf.gz files
Not sure if BCF is more useful but for now, my data is in .vcf.gz

Is supporting compressed VCF hard to implement?

ValueError: vcf is not a valid file or directory. Please provide a valid file or directory.

Hi Kevin , I'm trying this script but I'm running into this error during the prediction:
(the vcf file was annotated with VEP)

DEBUG | ezancestry.process:process_user_input:214 - list index out of range
Traceback (most recent call last):
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 217, in process_user_input
snpsdf = pd.read_csv(
File "/usr/local/lib/python3.9/dist-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 678, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 1253, in read
index, columns, col_dict = self._engine.read(nrows)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 270, in read
alldata = self._rows_to_cols(content)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 1013, in _rows_to_cols
self._alert_malformed(msg, row_num + 1)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 739, in _alert_malformed
raise ParserError(msg)
pandas.errors.ParserError: Expected 3 fields in line 7, saw 4

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/tigem/r.desantis/.local/bin/ezancestry", line 8, in
sys.exit(app())
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 214, in call
return get_command(self)(*args, **kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 532, in wrapper
return callback(**use_params) # type: ignore
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/commands.py", line 286, in predict
snpsdf = process_user_input(input_data, aisnps_directory, aisnps_set)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 232, in process_user_input
raise ValueError(
ValueError: a1.VEP.ann.vcf is not a valid file or directory. Please provide a valid file or directory.

Include some examples

Thanks to Amanda Kelly for the suggestion.
Give users the option to explore at least one sample as if it were their own.