Giter Site home page Giter Site logo

icgc-argo / seq-tools Goto Github PK

View Code? Open in Web Editor NEW
1.0 12.0 0.0 107.53 MB

Command line tools for ARGO sequencing data validation

Home Page: https://github.com/icgc-argo/seq-tools

License: GNU Affero General Public License v3.0

Dockerfile 0.33% Python 99.67%

seq-tools's Introduction

GitHub Actions CI Status

Command line tools for sequencing data validations

Installation

For Ubuntu, please make sure you have Python 3 (tested on 3.7, other 3.x versions should work too) installed first, then follow these steps to install the seq-tools (other OS should be similar):

# install samtools (which is mainly used to retrieve BAM header information)
sudo apt install samtools

# suggest to install jq to view JSON / JSONL in human-friendly format
sudo apt install jq

# clone the repo
git clone https://github.com/icgc-argo/seq-tools.git

# install using pip
cd seq-tools
pip3 install -r requirements.txt  # install Python dependencies
pip3 install .

# to install a specific version (eg, 1.1.0) without cloning the git repository
pip install git+https://github.com/icgc-argo/[email protected]  # replace 1.1.0 to other released version as needed

# to uninstall
pip uninstall seq-tools

# verify it by check the version
seq-tools -v

If you can run docker and prefer to use it, then there is no need to install seq-tools beforehand. See one of the examples below how to use seq-tools in docker container.

Try it out using testing data

Try it with example submissions under tests/submissions (assume you already cloned the repository).

cd tests/submissions
# validate the metadata JSON under 'HCC1160T.valid' directory, assuming data files are in the same directory
seq-tools validate HCC1160T.valid/sequencing_experiment.json   # you should see summary of validation result

# to view details of the above validation result
cat validation_report.PASS.jsonl | jq . | less

# use '-d' option if data files are located in a different directory than where the metadata file lives
seq-tools validate -d ../seq-data/ metadata_file_only/HCC1143T.WGS.meta.json

# to view details of the above validation result
cat validation_report.INVALID.jsonl | jq . | less

# or validate all metadata JSONs using wildcard in one go, assuming all data files are under '../seq-data/'
seq-tools validate -d ../seq-data/ */*.json   # as the summary indicates, three validation reports are generated

# view reported issues for INVALID metadata files
cat validation_report.INVALID.jsonl | jq . | less

# view details for PASS-with-WARNING metadata files
cat validation_report.PASS-with-WARNING.jsonl | jq . | less

# view details for PASS metadata files
cat validation_report.PASS.jsonl | jq . | less

# if you can run docker, here is how you may use it. Not suggested for users unfamiliar with Docker
cd ..  # make sure you are under the `tests` directory
docker pull quay.io/icgc-argo/seq-tools:1.0.1
alias seq-tools-in-docker="docker run -t -v `pwd`:`pwd` -w `pwd` quay.io/icgc-argo/seq-tools:1.0.1 seq-tools"
seq-tools-in-docker validate -d seq-data/ submissions/*/*.json  # you should see the same results as running seq-tools without docker

Use it to validate your own submissions

Similar to the example submissions under the tests/submissions, you have two options to organize your own submissions:

  1. either putting each metadata JSON file and its related data files in its own folder, such as tests/submissions/HCC1160T.valid/sequencing_experiment.json and test_rg_6.bam are under HCC1160T.valid folder with no other submissions in it;

  2. or putting different metadata files together into one folder but all related data files separately elsewhere, such as tests/submissions/metadata_file_only contains two metadata files: HCC1143N.WGS.meta.json and HCC1143T.WGS.meta.json, data files related to them are under tests/seq-data

For the first option, you can launch validation by specifying the metadata file, for example: seq-tools validate tests/submissions/HCC1160T.valid/sequencing_experiment.json. Note in this case there is no need to provide -d parameter since the data file is located under the same folder as the metadata file. For the second option, it's necessary to use -d to specify the directory where data files are located, for example, seq-tools validate -d tests/seq-data tests/submissions/metadata_file_only/*.json

Testing

Continuous integration testing is enabled using GitHub Actions. For validation check developers, you can manually run tests by:

pytest -v

seq-tools's People

Contributors

junjun-zhang avatar edsu7 avatar lindaxiang avatar

Stargazers

Yuankun Zhu avatar

Watchers

James Cloos avatar  avatar Ciarán Schütte avatar Melanie Courtot avatar Sam Rich avatar Hardeep Nahal-Bose avatar Dušan Andrić avatar  avatar Leonardo Rivera avatar Christina Yung avatar Dan avatar Bhavik Bhagat avatar

seq-tools's Issues

🐛 Failing c650 test due to empty SM

SM != "submitterSampleID"

Real data failing the follow tests:

  • c650_sm_in_bam_matches_metadata

Related warnings occur in :

  • c660_metadata_in_bam_rg_header

Due to specifications run during alignment setting SM to "". Fails when compared to existing submitter_read_group_id

@CO	user command line: STAR --genomeDir [redacted] --readFilesIn [redacted] [redacted] --runThreadN 4 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --alignSJDBoverhangMin 1 --genomeLoad NoSharedMemory --limitBAMsortRAM 70000000000 --readFilesCommand cat --outFilterMatchNminOverLread 0.33 --outFilterScoreMinOverLread 0.33 --sjdbOverhang 100 --outSAMstrandField intronMotif --outSAMattributes NH HI NM MD AS XS --outSAMunmapped Within --outSAMtype BAM SortedByCoordinate --outSAMheaderHD @HD VN:1.4 --outSAMattrRGline ID:[redacted] "LB:[redacted]" CN:QCMG SM: PL:ILLUMINA "PM:Illumina HiSeq 2000" PU:[redacted] --outSAMheaderCommentFile [redacted]

Steps To Reproduce

Running based on :
https://github.com/icgc-argo/argo-meta/blob/paca-au_rna-seq/icgc_song_payloads/APGI-AU/RNA-Seq/trial_batch/009bebd8-3cf9-45a8-8546-1a09a8bf91a1.json
Affected line:
(

elif bams and list(all_sms)[0] != submitter_sample_id:
)

Example output

From .validation_report.INVALID.jsonl

{"tool": {"name": "seq-tools", "version": "1.1.0"}, "metadata_file": "/home/ubuntu/downloads/data-submission/009bebd8-3cf9-45a8-8546-1a09a8bf91a1/009bebd8-3cf9-45a8-8546-1a09a8bf91a1.json", "data_dir": "/home/ubuntu/downloads/data-submission/009bebd8-3cf9-45a8-8546-1a09a8bf91a1", "started_at": "2022-03-04T18:38:45.586Z", "ended_at": "2022-03-04T18:38:56.955Z", "validation": {"status": "INVALID", "message": "Please see individual checks for details", "checks": [{"checker": "c650_sm_in_bam_matches_metadata", "status": "INVALID", "message": "SM in BAM header does not match submitterSampleId in metadata JSON:  vs 8043985"}]}}

From log

2022-03-04 18:38:45,700 [MainThread  ] [INFO ] [c650_sm_in_bam_matches_metadata] SM in BAM header does not match submitterSampleId in metadata JSON:  vs 804398

Ideas for resolution

Add a pass w/ warnings?

🐛 Validation does not check empty string for required fields

Describe the bug

The platform_unit field is a required in the sequencing_experiment schema but if it is submitted as an empty string in the metadata JSON, the validation does not report it as an error.

Steps To Reproduce

BAM header file: PU does not exist

@HD	VN:1.5	GO:none	SO:coordinate
@SQ	SN:chrM	LN:16569
@SQ	SN:chr1	LN:249250621
@SQ	SN:chr2	LN:243199373
@SQ	SN:chr3	LN:198022430
@SQ	SN:chr4	LN:191154276
@SQ	SN:chr5	LN:180915260
@SQ	SN:chr6	LN:171115067
@SQ	SN:chr7	LN:159138663
@SQ	SN:chr8	LN:146364022
@SQ	SN:chr9	LN:141213431
@SQ	SN:chr10	LN:135534747
@SQ	SN:chr11	LN:135006516
@SQ	SN:chr12	LN:133851895
@SQ	SN:chr13	LN:115169878
@SQ	SN:chr14	LN:107349540
@SQ	SN:chr15	LN:102531392
@SQ	SN:chr16	LN:90354753
@SQ	SN:chr17	LN:81195210
@SQ	SN:chr18	LN:78077248
@SQ	SN:chr19	LN:59128983
@SQ	SN:chr20	LN:63025520
@SQ	SN:chr21	LN:48129895
@SQ	SN:chr22	LN:51304566
@SQ	SN:chrX	LN:155270560
@SQ	SN:chrY	LN:59373566
@SQ	SN:chr1_gl000191_random	LN:106433
@SQ	SN:chr1_gl000192_random	LN:547496
@SQ	SN:chr4_gl000193_random	LN:189789
@SQ	SN:chr4_gl000194_random	LN:191469
@SQ	SN:chr7_gl000195_random	LN:182896
@SQ	SN:chr8_gl000196_random	LN:38914
@SQ	SN:chr8_gl000197_random	LN:37175
@SQ	SN:chr9_gl000198_random	LN:90085
@SQ	SN:chr9_gl000199_random	LN:169874
@SQ	SN:chr9_gl000200_random	LN:187035
@SQ	SN:chr9_gl000201_random	LN:36148
@SQ	SN:chr11_gl000202_random	LN:40103
@SQ	SN:chr17_gl000203_random	LN:37498
@SQ	SN:chr17_gl000204_random	LN:81310
@SQ	SN:chr17_gl000205_random	LN:174588
@SQ	SN:chr17_gl000206_random	LN:41001
@SQ	SN:chr18_gl000207_random	LN:4262
@SQ	SN:chr19_gl000208_random	LN:92689
@SQ	SN:chr19_gl000209_random	LN:159169
@SQ	SN:chr21_gl000210_random	LN:27682
@SQ	SN:chrUn_gl000211	LN:166566
@SQ	SN:chrUn_gl000212	LN:186858
@SQ	SN:chrUn_gl000213	LN:164239
@SQ	SN:chrUn_gl000214	LN:137718
@SQ	SN:chrUn_gl000215	LN:172545
@SQ	SN:chrUn_gl000216	LN:172294
@SQ	SN:chrUn_gl000217	LN:172149
@SQ	SN:chrUn_gl000218	LN:161147
@SQ	SN:chrUn_gl000219	LN:179198
@SQ	SN:chrUn_gl000220	LN:161802
@SQ	SN:chrUn_gl000221	LN:155397
@SQ	SN:chrUn_gl000222	LN:186861
@SQ	SN:chrUn_gl000223	LN:180455
@SQ	SN:chrUn_gl000224	LN:179693
@SQ	SN:chrUn_gl000225	LN:211173
@SQ	SN:chrUn_gl000226	LN:15008
@SQ	SN:chrUn_gl000227	LN:128374
@SQ	SN:chrUn_gl000228	LN:129120
@SQ	SN:chrUn_gl000229	LN:19913
@SQ	SN:chrUn_gl000230	LN:43691
@SQ	SN:chrUn_gl000231	LN:27386
@SQ	SN:chrUn_gl000232	LN:40652
@SQ	SN:chrUn_gl000233	LN:45941
@SQ	SN:chrUn_gl000234	LN:40531
@SQ	SN:chrUn_gl000235	LN:34474
@SQ	SN:chrUn_gl000236	LN:41934
@SQ	SN:chrUn_gl000237	LN:45867
@SQ	SN:chrUn_gl000238	LN:39939
@SQ	SN:chrUn_gl000239	LN:33824
@SQ	SN:chrUn_gl000240	LN:41933
@SQ	SN:chrUn_gl000241	LN:42152
@SQ	SN:chrUn_gl000242	LN:43523
@SQ	SN:chrUn_gl000243	LN:43341
@SQ	SN:chrUn_gl000244	LN:39929
@SQ	SN:chrUn_gl000245	LN:36651
@SQ	SN:chrUn_gl000246	LN:38154
@SQ	SN:chrUn_gl000247	LN:36422
@SQ	SN:chrUn_gl000248	LN:39786
@SQ	SN:chrUn_gl000249	LN:38502
@SQ	SN:hs37d5	LN:35477943
@SQ	SN:NC_007605	LN:171823
@RG	ID:LP6008269-DNA_H08	PL:ILLUMINA	LB:LP6008269-DNA_H08	SM:LP6008269-DNA_H08
@PG	ID:bwa	PN:bwa	VN:0.7.10-r789	CL:/home/mib-cri/software/bwa//bwa-0.7.10/bwa mem -R @RG\tID:LP6008269-DNA_H08\tSM:LP6008269-DNA_H08\tLB:LP6008269-DNA_H08\tPL:ILLUMINA -p /lustre/projects/stlab-icgc/software/ref_genomes/GRCh37_g1k//bwa-0.7.5a/Hsa.GRCh37.bwa -
@PG	ID:MarkDuplicates	PN:MarkDuplicates	VN:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)	CL:picard.sam.MarkDuplicates INPUT=[/lustre/projects/stlab-icgc/prod/batch18f_realign/alignment/LP6008269-DNA_H08/1.5/HFH7WALXX.5.lane.bam, /lustre/projects/stlab-icgc/prod/batch18f_realign/alignment/LP6008269-DNA_H08/1.5/HFH7WALXX.6.lane.bam] OUTPUT=/lustre/projects/stlab-icgc/prod/batch18f_realign/alignment/LP6008269-DNA_H08/1.5/LP6008269-DNA_H08.bam METRICS_FILE=/lustre/projects/stlab-icgc/prod/batch18f_realign/alignment/LP6008269-DNA_H08/1.5/LP6008269-DNA_H08.lane.dups.txt.pipetemp READ_NAME_REGEX=[a-zA-Z0-9_\-]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* TMP_DIR=[/lustre/projects/stlab-icgc/prod/batch18f_realign/alignment/LP6008269-DNA_H08/1.5/temp] VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false CREATE_MD5_FILE=false

Metadata JSON file: platform_unit is an empty string:

{
    "studyId": "OCCAMS-GB",
    "analysisType": {
        "name": "sequencing_experiment"
    },
    "samples": [
        {
            "submitterSampleId": "LP6008269-DNA_H08",
            "matchedNormalSubmitterSampleId": "LP6008268-DNA_H08",
            "sampleType": "Total DNA",
            "specimen": {
                "submitterSpecimenId": "83fcf464cf46317238331e5c7da26690e645be11aee45d49c6ed91da490e313f",
                "specimenType": "Primary tumour",
                "tumourNormalDesignation": "Tumour",
                "specimenTissueSource": "Solid tissue"
            },
            "donor": {
                "submitterDonorId": "b8bfd29057a30c787f3ce32ad6fd64aeb349043a52c25d917cb28644bc0a48a0",
                "gender": "Male"
            }
        }
    ],
    "files": [
        {
            "dataType": "Submitted Reads",
            "fileName": "563f0b6e132965027e6ec742752a4757.LP6008269-DNA_H08.bam",
            "fileSize": 152209724860,
            "fileType": "BAM",
            "fileMd5sum": "563f0b6e132965027e6ec742752a4757",
            "fileAccess": "controlled",
            "info": {
                "legacyAnalysisId": "EGAR00001566506",
                "data_category": "Sequencing Reads"
            }
        }
    ],
    "read_groups": [
        {
            "file_r1": "563f0b6e132965027e6ec742752a4757.LP6008269-DNA_H08.bam",
            "file_r2": "563f0b6e132965027e6ec742752a4757.LP6008269-DNA_H08.bam",
            "insert_size": null,
            "platform_unit": "",
            "is_paired_end": true,
            "library_name": "LP6008269-DNA_H08",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "read_group_id_in_bam": null,
            "submitter_read_group_id": "LP6008269-DNA_H08"
        }
    ],
    "experiment": {
        "submitter_sequencing_experiment_id": "EXP-927",
        "experimental_strategy": "WGS",
        "sequencing_center": "",
        "platform": "ILLUMINA",
        "platform_model": null,
        "sequencing_date": null
    },
    "read_group_count": 1
}

Snippet from report.json (other checks were correctly validated):

{
        "checker": "c140_platform_unit_uniqueness",
        "status": "VALID",
        "message": "Platform unit uniqueness check status: VALID"
      },

Expected behaviour

Validation should recognize an empty string as a missing value for a required field and return INVALID

New checker (c240): 'read_group_id_in_bam' and 'submitter_read_group_id' collision check

for the same BAM when 'read_group_id_in_bam' is missing for one read group, its 'submitter_read_group_id' can NOT be the same as 'read_group_id_in_bam' in another read group

Here is an invalid read group metadata, "submitter_read_group_id": "CRUK-CI:LP6005499-DNA_D03" in the first read group collide with "read_group_id_in_bam": "CRUK-CI:LP6005499-DNA_D03" in the second read group:

{
  "read_groups": [
    {
      "file_r1": "17e6368cb4ae6ae3ac0ca12e20c90dd4.bam",
      "file_r2": "17e6368cb4ae6ae3ac0ca12e20c90dd4.bam",
      "insert_size": null,
      "library_name": "WGS:CRUK-CI:LP6005499-DNA_D03",
      "is_paired_end": true,
      "platform_unit": "CRUK-CI:LP6005499-DNA_D03_4",
      "read_length_r1": null,
      "read_length_r2": null,
      "sample_barcode": null,
      "read_group_id_in_bam": null,
      "submitter_read_group_id": "CRUK-CI:LP6005499-DNA_D03"
    },
    {
      "file_r1": "17e6368cb4ae6ae3ac0ca12e20c90dd4.bam",
      "file_r2": "17e6368cb4ae6ae3ac0ca12e20c90dd4.bam",
      "insert_size": null,
      "library_name": "WGS:CRUK-CI:LP6005499-DNA_D03",
      "is_paired_end": true,
      "platform_unit": "CRUK-CI:LP6005499-DNA_D03_5",
      "read_length_r1": null,
      "read_length_r2": null,
      "sample_barcode": null,
      "read_group_id_in_bam": "CRUK-CI:LP6005499-DNA_D03",
      "submitter_read_group_id": "CRUK-CI:LP6005499-DNA_D03_1"
    },
    {
      "file_r1": "17e6368cb4ae6ae3ac0ca12e20c90dd4.bam",
      "file_r2": "17e6368cb4ae6ae3ac0ca12e20c90dd4.bam",
      "insert_size": null,
      "library_name": "WGS:CRUK-CI:LP6005499-DNA_D03",
      "is_paired_end": true,
      "platform_unit": "CRUK-CI:LP6005499-DNA_D03_6",
      "read_length_r1": null,
      "read_length_r2": null,
      "sample_barcode": null,
      "read_group_id_in_bam": "CRUK-CI:LP6005499-DNA_D03''",
      "submitter_read_group_id": "CRUK-CI:LP6005499-DNA_D03_2"
    }
  ]
}

Deal with non-submission looking directory gracefully

Need to verify some essential elements in a user-supplied submission dir. In case it's not looking like an properly structured submission dir, abort and inform user with informative message. There is no point to continue with normal path.

Add README.md for Validation Checks in the Seq-Tools

With the increasing number of validation items being added into seq-tools, we should add a README.md to include descriptions for all the validation items:

Maybe under folder: https://github.com/icgc-argo/seq-tools/tree/main/seq_tools/validation

New checker (c660): other RG fields in BAM header

excluding ID and SM, all other RG fields in BAM header are optional, however, if any of the fields is provided in BAM it should match what's provided in the SONG metadata. Here are the mappings between @rg fields and the corresponding fields in SONG metadata:

  • ID=submitter_read_group_id (or read_group_id_in_bam if populated)
  • SM=submitterSampleId
  • BC=sample_barcode
  • CN=sequencing_center
  • DT=sequencing_date
  • LB= library_name
  • PI=insert_size
  • PL=platform
  • PM=platform_model
  • PU=platform_unit

If not matching, set validation status as WARNING and state clearly that information in metadata (not that of their submitted BAM) will be used in the header of ARGO uniformly aligned sequences.

Detect `RNA-seq strandedness` in FASTQ

We need to run an alignment on our FASTQ files or a subset of them to determine the strandedness; need to add this check to our quite of seq-tools checks

FASTQ sanity check

Currently our seq-tools validator doesn't actually check that the file adheres to the FASTQ format and is not just a random text file

So we need add this check into our suite of seq-tools checks

Feature Request: QA from seq-tools

  1. INVALID should actually mean an invalid payload that we do not want submitted. Currently, the “c670_coverage_estimate” check is run and assigned a value of PASS or INVALID, but is actually optional. It can fail, but the user can still submit that data based on user-judgement of the data. This is more of a fyi or warning status. If the submitter has more data they can add it, but they may not and choose to still submit.
  • Remove this check. I discussed adding this as a warning on Slack, but in discussion with Christina, said its better to just remove it until we know more
  1. The report result is long and adds unactionable (thus unnecessary) information, making it harder to read when debugging. Overall the experience could be better in helping me identify actionable items to make my payloads valid.
    Anytime a validation fails, I have to read through everything that passes to get to the failures.
  • To make it more actionable, it should only present the failures. the INVALID and PASS metrics could be separated into two reports for this.
  • Failures are in individual reports, meaning I have to go through many folder to get to issues. It would be nice to see a whole report, of all failures so i could work with many at once. For example:
{"summary": 
"INVALID": 4, 
{"submission_dir":  "/path/to/payload" "Status": "INVALID",  "failed_checkpoints": {ONLY TEH FAILED ISSUES}}
{"submission_dir":  "/path/to/payload" "Status": "INVALID",  "failed_checkpoints": {ONLY TEH FAILED ONES}}
{"submission_dir":  "/path/to/payload" "Status": "INVALID",  "failed_checkpoints": {ONLY TEH FAILED ONES}}
{"submission_dir":  "/path/to/payload" "Status": "INVALID",  "failed_checkpoints": {ONLY TEH FAILED ONES}}
}

This way I can action on fixing all of the things without going into individual folders for each one, and rerun validation in bulk.

  1. The tool is too opinionated on the data formatting for input. It is not easily usable as it is not easily scriptable as you have to conform to the directory structure. I had to rename all my payloads as I moved them to the location. This was annoying, and I don’t think it added value. I had them named to align with my data (similar to how Hardeep names the import payloads with the EGA id) to be informational to me. It's possible to yes, but it feels much too strict. It also introduces room for error, as when you are looking for files they all look the same; when looking through possibly a lot of failing payloads this has a large potential to be confusing and hard to follow.
  • It would be easier if the command was this: seq-tools validate /path/to/payload/filename /directory/to/files
  • remove the restriction that payload has to be named sequencing_experiment.json
  1. Having a file in a submission directory causes a weird error.
rbajari@wl5503-rbajar:~/DASH-CA_transferred_data/testing-payloads$ seq-tools validate *.*
Traceback (most recent call last):
  File "/home/rbajari/.local/bin/seq-tools", line 11, in <module>
    sys.exit(main())
  File "/home/rbajari/.local/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/rbajari/.local/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/rbajari/.local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/rbajari/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/rbajari/.local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/rbajari/.local/lib/python3.6/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/rbajari/.local/lib/python3.6/site-packages/seq_tools/cli/__init__.py", line 89, in validate
    perform_validation(ctx, subdir=subdir)
  File "/home/rbajari/.local/lib/python3.6/site-packages/seq_tools/validation/__init__.py", line 55, in perform_validation
    initialize_log(ctx, subdir)
  File "/home/rbajari/.local/lib/python3.6/site-packages/seq_tools/utils.py", line 48, in initialize_log
    os.mkdir(log_directory)
NotADirectoryError: [Errno 20] Not a directory: 'report.json/logs'

Optional Changes:

  1. On the initial output, some color coding would be nice. If it is invalid, highlighting it in red would be nice.

image

  1. Some issues with python2 v python 3 will exist in the instructions. For example, I had to use pip3 to install samtools. This is obviously system dependent. It would be better if we could package this with dependencies in a way that does not require the user to do this extra work to install. Maybe as a Docker or PyInstaller?

New feature: add a flag to skip md5sum check

for WGS data, md5sum check takes a long time to complete. As all other checks are very fast, md5sum check may take over 90% of the execution time. In order to allow much quicker iteration of metadata checking and fixing, the user can temporarily skip md5sum check until all other checks are passed.

As c680 check includes both fileSize and fileMd5sum checks, it would be good to split them into two separate checks: c681_fileSize_match and c683_fileMd5sum_match.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.