Giter Site home page Giter Site logo

nbisweden / igdiscover-legacy Goto Github PK

View Code? Open in Web Editor NEW
17.0 44.0 10.0 1.6 MB

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads

Home Page: https://www.igdiscover.se

License: MIT License

Python 99.89% TeX 0.01% Shell 0.03% Singularity 0.08%
python bioinformatics-pipeline

igdiscover-legacy's Introduction

igdiscover-legacy's People

Contributors

ganeshphad avatar jhagberg avatar marcelm avatar nestorvb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

igdiscover-legacy's Issues

Update IgBLAST

We are using 1.4.0, but 1.6.0 has been released recently.

Remove intermediate files by default

It is not necessary to keep large files in intermediate iterations: Candidates for deletion are assigned.tab.gz, igblast.txt.gz and filtered.tab.gz.

Check whether input database is ok

Originally reported by: Marcel Martin (Bitbucket: marcelm, GitHub: marcelm)


When running igdiscover init, we should make sure that the FASTA files

  • exist
  • are parseable
  • do not contain duplicate sequences
  • do not contain duplicate sequence ids
  • are named V.fasta, J.fasta and (optional) D.fasta
  • if no D.fasta exists, a dummy file should be created

If we do not, makeblastdb will fail at a much later stage.


Improve CDR3 detection by using a more recent IgBLAST

The CDR3 detection regular expression by D'Addario et al. that we use is roughly like this [FY] [FHVWY] C ... W [AGV] (the ... is the CDR3, the things to the left and right are allowed amino acids). However, we do not detect some CDR3s with it. For example, in one of the datasets, the C is mutated and becomes a W.

Assign clonotype to a small number of sequences

Given a list of clonotypes and a FASTA file with previously unseen sequences, assign each of the sequences in the FASTA file to its matching clonotype.

For this, we need a way to run IgBLAST separately on a single sequence.

Depends on #41

Be more helpful when validating the input database

igdiscover init validates the input database and checks whether there are duplicate sequences.

The error we get complains only about the first duplicate sequence that it finds, so it is necessary to re-run init in order to fix and find all problems, which is annoying, especially if the GUI is used.

For cross-mapping correction, ignore last 5 bases

When computing the distance between two V sequences, ignore differences in those last bases that are part of the CDR3

  • Among multiple V sequences, the one with the highest number of CDR3s is usually the full-length one - that one should be kept.

Count and report clonotypes

For now, two sequences belong to the same clonotype if

  • they have the same V and J assignment (ignore D)
  • their CDR3s have the same length
  • their CDR3s are identical to some adjustable threshold (70-100%)

Input is the filtered.tab.gz file. Output should be a table in which sequences coming from the same clonotype are listed next to each other.

Better subsampling

Originally reported by: Marcel Martin (Bitbucket: marcelm, GitHub: marcelm)


If there are many sequences assigned to a single gene, we may miss some low-expressed ones when we cluster only 1000 randomly picked ones. We can instead subsample from those that have some minimum edit distance to the database gene in order to make sure that we do get more interesting sequences.


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.