Giter Site home page Giter Site logo

jappy0 / deduplication_errorcorrection Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 86.97 MB

How error correction affects PCR deduplication: a survey based on UMI datasets of short reads

Python 68.58% Shell 0.72% Makefile 0.51% C 6.80% C++ 21.61% Perl 1.78%

deduplication_errorcorrection's Introduction

How error correction affects PCR deduplication: a survey based on UMI datasets of short reads

This repository contains the scripts, the state-of-the-art methods and datasets (links) used to conduct the analysis for the study "How error correction affects PCR deduplication: a survey based on UMI datasets of short reads".

Notes: This repository contains other researchers' source codes and executable software of PCR-deduplication and error-correction on short reads. The copyright of these source codes and programs belongs to the original authors. If you use all or part of them, please remember to cite these works list in Reference.

Table of Contents

Reproducing results

All the experiments were running on the Red Hat Enterprise Linux release 8.6 with the CPU model of Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz.

Notes: Some PCR-deduplication and error correction methods may yield slightly different deduplication and error correction results by running on different machines or different time.

Datasets

We take 8 UMI-based single-end sequencing datasets of the accession IDs SRR1543964 - SRR1543971 together with a pair-end sequencing dataset of the accession ID SRR11207257 for the comparative analysis.

Data pre-processing

For processing datasets SRR1543964 - SRR1543971, run the following scripts

$ python ./src/preprocessing_se.py

For processing pair-end datasets SRR11207257,

$ python ./src/preprocessing_pe.py

The pre-processed datasets can be found at OneDrive.

Comparative Analysis

Clone this repository first

git clone https://github.com/Jappy0/Deduplication_ErrorCorrection
cd Deduplication_ErrorCorrection

Deduplication

Install some deduplication and error-correction tools via Anaconda and pip

pip install git+https://github.com/pinellolab/AmpUMI.git
conda install bioconda::fastp
conda install bioconda::minirmd
conda install bioconda::cd-hit
conda install bioconda::calib
conda install bioconda::fastuniq
conda install bioconda::bcool

Then, run the following scripts

Notes: please remember to replace the input directories to your owns before running the following scripts.

python ./src/deduplication.py

Error-Correction

python ./src/error_correction.py

for error correction by bcool, you need to run the separate script

python ./src/bcool.py

Deduplication after Error-Correction

python ./src/deduplication_after_error_correction.py

Reference

Kendell Clement, Rick Farouni, Daniel E. Bauer, and Luca Pinello. AmpUMI: Design and analysis of unique molecular identifiers for deep amplicon sequencing. Bioinformatics (Oxford, England), 34(13):i202–i210, 2018.

Tom Smith, Andreas Heger, and Ian Sudbery. UMI-tools: Modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Research, 27(3):491–499, 2017.

Maria Tsagiopoulou, Maria Christina Maniou, Nikolaos Pechlivanis, Anastasis Togkousidis, Michaela Kotrová, Tobias Hutzenlaub, Ilias Kappas, Anastasia Chatzidimitriou, and Fotis Psomopoulos. UMIc: A Preprocessing Method for UMI Deduplication and Reads Correction. Frontiers in Genetics, 12:660366, 2021.

Antonio Sérgio Cruz Gaia, Pablo Henrique Caracciolo Gomes de Sá, Mônica Silva de Oliveira, and Adonney Allan de Oliveira Veras. NGSReadsTreatment – A Cuckoo Filter-based Tool for Removing Duplicate Reads in NGS Data. Scientific Reports, 9(1):11681, 2019.

Hang Dai and Yongtao Guan. Nubeam-dedup: A fast and RAMefficient tool to de-duplicate sequencing reads without mapping. Bioinformatics, 36(10):3254–3256, 2020.

Gianvito Urgese, Emanuele Parisi, Orazio Scicolone, Santa Di Cataldo, and Elisa Ficarra. BioSeqZip: A collapser of NGS redundant reads for the optimization of sequence analysis. Bioinformatics, 36(9):2705–2711, 2020.

Shifu Chen, Yanqing Zhou, Yaru Chen, and Jia Gu. Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17):i884–i890, 2018.

Jason A. Vander Heiden, Gur Yaari, Mohamed Uduman, Joel N.H. Stern, Kevin C. O’Connor, David A. Hafler, Francois Vigneault, and Steven H. Kleinstein. pRESTO: A toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics, 30(13):1930–1932, 2014.

Ying Huang, Beifang Niu, Ying Gao, Limin Fu, and Weizhong Li. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics (Oxford, England), 26(5):680–682, 2010.

Jorge González-Domínguez and Bertil Schmidt. ParDRe: Faster parallel duplicated reads removal tool for sequencing studies. Bioinformatics (Oxford, England), 32(10):1562–1564, 2016.

Yuansheng Liu, Xiaocai Zhang, Quan Zou, and Xiangxiang Zeng. Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers. Bioinformatics, 37(11):1604– 1606, 2021.

Baraa Orabi, Emre Erhan, Brian McConeghy, Stanislav V Volik, Stephane Le Bihan, Robert Bell, Colin C Collins, Cedric Chauve, and Faraz Hach. Alignment-free clustering of umi tagged dna molecules. Bioinformatics, 35(11):1829–1836, 2019.

Haibin Xu, Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan Song, Guangrui Qian, Jinhui Chen, and Shilin Chen. FastUniq: A fast de novo duplicates removal tool for paired short reads. PloS One, 7(12):e52249, 2012.

Heng Li. BFC: Correcting Illumina sequencing errors. Bioinformatics (Oxford, England), 31(17):2885–2887, 2015.

Felix Kallenborn, Andreas Hildebrandt, and Bertil Schmidt. CARE: Context-aware sequencing read error correction. Bioinformatics, 37(7):889–895, 2021.

Felix Kallenborn, Julian Cascitti, and Bertil Schmidt. CARE 2.0: Reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics, 23(1):227, 2022.

Leena Salmela and Jan Schröder. Correcting errors in short reads by multiple alignments. Bioinformatics, 27(11):1455–1461, 2011.

Marcel H. Schulz, David Weese, Manuel Holtgrewe, Viktoria Dimitrova, Sijia Niu, Knut Reinert, and Hugues Richard. Fiona: A parallel and automatic strategy for read error correction. Bioinformatics, 30(17):i356–i363, 2014.

Li Song, Liliana Florea, and Ben Langmead. Lighter: Fast and memory-efficient sequencing error correction without counting. Genome Biology, 15(11):509, 2014.

Eric Marinier, Daniel G. Brown, and Brendan J. McConkey. Pollux: Platform independent error correction of single and mixed genomes. BMC Bioinformatics, 16(1):10, 2015.

Lucian Ilie and Michael Molnar. RACER: Rapid and accurate correction of errors in reads. Bioinformatics (Oxford, England), 29(19):2490–2493, 2013.

Antoine Limasset, Jean-François Flot, and Pierre Peterlongo.Toward perfect reads: Self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics, 36(2):651, 2020.

Contact

Feel free to contact me ([email protected]) if you have any questions, comments, suggestions etc regarding this study.

deduplication_errorcorrection's People

Contributors

jappy0 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.