Giter Site home page Giter Site logo

ngless-toolkit / ngless Goto Github PK

View Code? Open in Web Editor NEW
143.0 11.0 25.0 14.54 MB

NGLess: NGS with less work

Home Page: https://ngless.embl.de

License: Other

Haskell 87.39% C 0.21% Makefile 0.54% Shell 1.41% CSS 0.63% HTML 1.53% Python 0.63% Nix 7.41% C++ 0.26%
haskell bioinformatics bioinformatics-pipeline samtools bwa next-generation-sequencing fastq-format fastq science ngs metagenomics genomics haskell-language

ngless's Introduction

NGLess: NGS Processing with Less Work

NGLess logo Ngless is a domain-specific language for NGS (next-generation sequencing data) processing.

Build & test MIT licensed Install with Bioconda Install with Bioconda Citation for NGLess

For questions and discussions, please use the NGLess mailing list.

If you are using NGLess, please cite:

NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language by Luis Pedro Coelho, Renato Alves, Paulo Monteiro, Jaime Huerta-Cepas, Ana Teresa Freitas, Peer Bork, Microbiome (2019) https://doi.org/10.1186/s40168-019-0684-8

NGLess cartoon

Example

ngless "1.5"
input = fastq(['ctrl1.fq','ctrl2.fq','stim1.fq','stim2.fq'])
input = preprocess(input) using |read|:
    read = read[5:]
    read = substrim(read, min_quality=26)
    if len(read) < 31:
        discard

mapped = map(input,
                reference='hg19')
write(count(mapped, features=['gene']),
        ofile='gene_counts.csv',
        format={csv})

For more information, check the docs. We also have a YouTube tutorial on how to use NGLess and SemiBin together (but you can learn to use NGLess independently of SemiBin).

Installing

See the install documentation for more information.

Bioconda

The recommended way to install NGLess is through bioconda:

conda install -c bioconda ngless 

Docker

Alternatively, a docker container with NGLess is available at docker hub:

docker run -v $PWD:/workdir -w /workdir -it nglesstoolkit/ngless:1.5.0 ngless --version

Adapt the mount flags (-v) as needed.

Linux

You can download a statically linked version of NGless 1.5.0

This should work across a wide range of Linux versions (please report any issues you encounter):

curl -L -O https://github.com/ngless-toolkit/ngless/releases/download/v1.5.0/NGLess-v1.5.0-Linux-static-full
chmod +x NGLess-v1.5.0-Linux-static-full
./NGLess-v1.5.0-Linux-static-full

This downloaded file bundles bwa, samtools and megahit (also statically linked).

From Source

Installing/compiling from source is also possible. Clone https://github.com/ngless-toolkit/ngless

Dependencies

The simplest way to get an environment with all the dependencies is to use conda:

conda create -n ngless
conda activate ngless
conda config --add channels conda-forge
conda install stack cairo bzip2 gmp zlib perl wget xz pkg-config make

You should have gcc installed (or another C-compiler).

The following sequence of commands should download and build the software

git clone https://github.com/ngless-toolkit/ngless
cd ngless
stack setup
make

To install, you can use the following command (replace <PREFIX> with the directory where you wish to install, default is /usr/local):

make make

Running Sample Test Scripts on Local Machine

For developers who have successfully compiled and installed NGless, running the test scripts in the tests folder would be the next line of action to have the output of sample test cases.

cd tests

Once in the tests directory, select any of the test folders to run NGless.

For example, here we would run the regression-fqgz test:

cd regression-fqgz
ngless ungzip.ngl

After running this script open the newly generated folder ungzip.ngl.output_ngless and view the template in the index.html file.

For developers who have done this much more datasets for testing purposes can be referenced and used by reading these documentation links: Human Gut Metagenomics Functional & Taxonomic Profiling Ocean Metagenomics Functional Profiling Ocean Metagenomics Assembly and Gene Prediction

More information

Authors

ngless's People

Contributors

gitter-badger avatar luispedro avatar mkuhn avatar montoias avatar nairsajjal avatar psj1997 avatar unode avatar vedanth-ramji avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ngless's Issues

Document <references>

The special value <references> can be used to refer to a common location for resources.
It's convenient to use with map(fafile="<references>/myfasta.fna", ...).
Its value comes from calling ngless with --search-dir.

Multiple `-1` with count() when multiple features/subfeatures are requested

This is related to #34.

The special case of -1 is currently written several times to the output.

With multiple subfeatures, there's an additional bug in that the -1 line doesn't include the 2 identifying prefixes:

	reads.fq.gz
-1	4
gene	gene_id	feature_A	1
gene	gene_id	feature_B	2
gene	gene_id	feature_C	2
gene	gene_id	feature_D	1
-1	4
gene	gene_name	featA	1
gene	gene_name	featB	2
gene	gene_name	featC	2
gene	gene_name	featD	1

GFF subfeature counting should be expanded

The GFF3 format specification allows for multiple value attributes if separated with a comma.
From the official docs:

Parent=AF2312,AB2812,abc-3

With the current version of NGLess a GFF file:

##gff-version 3
reference	protein_coding	gene	40	100	.	+	.	gene_id=geneA;gene_name=featA1,featA2
reference	protein_coding	gene	110	130	.	+	.	gene_id=geneB;gene_name=featA1
reference	protein_coding	gene	140	200	.	+	.	gene_id=geneC;gene_name=featA2

and a script:

    ngless "0.7"

    input = fastq('reads.fq.gz')
    mapped = map(input, fafile='ref.fna.gz')

    union = count(mapped,
                  gff_file='features.gff',
                  features=['gene'],
                  subfeatures=['gene_name'],
                  mode={union})
    write(union, ofile='output.txt')

produces:

	reads.fq.gz
-1	4
featA1	0
featA1,featA2	4
featA2	1

however the expectation is that values are expanded.
Additionally, expansion should take into account the content of arguments mode= and multiple= in count():

	reads.fq.gz
-1	4
featA1	4
featA2	5

Document `assemble()`

The docs referencing orf_find() mention assemble() which isn't mentioned anywhere else.

map-split: use relative instead of absolute path for symlinks

As part of the split-map strategy implemented in ngless 0.6 symlinks are created to avoid copying split files. As of this version symlinks use absolute paths.
While this works on a single machine it may break on shared filesystems (different mountpoints) or if the index folder needs to be manually relocated.

Making symlinks relative to the root of the index folder, addresses the issues mentioned above.

Clarify documentation of `mode` in `count()` to be more user friendly

count( mode = ... )'s documentation is unclear and perhaps a little low-level. It explains how the options relate to different sets but not how these sets relate to the read or the features being counted.

For example, if a read maps to a region with two non-overlapping features, and the option mode={intersection-nonempty} is used, does it mean that the read is not considered?

Perhaps some illustration will be clearer. ASCII-art to the rescue:

Reference  ###################################

Feature_A       ========
Feature_B            ==========
Feature_C                  ============

Read1              -----
Read2               -----
Read3                  -----
Read4                         -----
Read5                               -----
Read6                                   -----

  • Read1 is contained in Feature_A and partially overlaps Feature_B.
  • Read2 is not contained in any feature but partially overlaps Feature_A and Feature_B.
  • Read3 is contained in Feature_B and partially overlaps Feature_A and Feature_C.
  • Read4 is contained in Feature_C
  • Read5 partially overlaps Feature_C
  • Read6 doesn't overlap any feature.

And the following perhaps in the form of a table

With:

  • mode={union}

    • Read1 is counted for Feature_A...
    • Read2 is counted for Feature_A...
    • Read3 is counted for Feature_B...
    • Read4 is counted for Feature_C...
    • Read5 is counted ...
    • Read6 is never counted
  • mode={intersection-strict}

    • Read1 is ...
    • Read2 is ...
    • Read3 is ...
    • Read4 is ...
    • Read5 is ...
    • Read6 is never counted
  • mode={intersection-nonempty}

    • Read1 is ...
    • Read2 is ...
    • Read3 is ...
    • Read4 is ...
    • Read5 is ...
    • Read6 is never counted

Add interleaved FastQ support

The format is not formally described but is used in the wild. On a quick search there was no mention on how 'singles' are handled. Possibilities include:

  1. Output .1 followed by .2 and add singles at the end of the file
  2. Tolerate unpaired singles in the middle of the file.

The second variant is more versatile (e.g. for filter()) as it doesn't require a second file to hold reads as they are being processed.

External module and parser limitations

arg1 fails if of atype: 'str'

Using Modules/test.ngm/0.1/module.yaml:

version: '0.1'
name: 'test'
functions:
    -
        nglName: "test"
        arg0: './dummy.sh'
        arg1:
            atype: 'str'
        return:
            rtype: 'void'
            name: 'output'
            extension: 'void'

and test.ngl:

ngless "0.6"
local import "test" version "0.1"

sample = "test"
test(sample)

results in:

Exiting after fatal error while loading and running script
Should Not Occur Error! This probably indicates a bug in ngless.
        Please get in touch with the authors with a description of how this happened.
        If possible run your script with the --trace flag and post the script and the resulting trace at 
                http://github.com/luispedro/ngless/issues
        or email us at [email protected].
AsFile path got NGOString "test"

return: rtype: void requires name and extension.

Despite using rtype = void, NGLess complained when I didn't include a name and an extension argument as part of return:

Could not load module file ./Modules/test.ngm/0.1/module.yaml. Error was `Error in $.functions[0].return: key "name" not present`
Could not load module file ./Modules/test.ngm/0.1/module.yaml. Error was `Error in $.functions[0].return: key "extension" not present`

Cannot define function with named arguments only

Trying to omit arg1 results in a parsing error.

This is both visible in the case of:

local import "test" version "0.1"
test()

as well as in:

local import "test" version "0.1"
sample = "test"
test(name=sample)

resulting in:

unexpected TOperator ')'
expecting len (reserved word), operator -, not (reserved word), operator (, function call, operator [ or variable

or

unexpected TOperator '='
expecting binary operator, keyword argument list or operator )

Explicit way of invalidating locks

Sometimes when I'm testing things, jobs fail and leave active locks behind.
Subsequent calls fail as they cannot obtain any lock.

Currently I workaround this by removing the ngless-locks folder or the matching subfolder inside ngless-locks which is suboptimal.

With that said, it would be nice if after:

ngless --options ... mycustomscript.ngl

one could:

ngless --clear-locks mycustomscript.ngl

Decrease memory usage of count() with seqname

Currently, the code uses a sorted Vector InfoRef where InfoRef is

data InfoRef = InfoRef {-# UNPACK #-} !ShortByteString {-# UNPACK #-} !Double

but this has a lot of memory overhead compared to something like a C++ std::vector<std::pair<const char*, double>> where the string data is packed together.

This could easily be done as a generic Haskell library independent of any NGLess code.

Add search path for references

For example:

map(input, fafile="<>/catalog/file.fna")

and ngless --path "/usr/share/...:/opt/share/...:..." would look into all the given directories.

Argument not checked prior to execution

According to the FAQ ngless is supposed to check all input files prior to execution.

In the case below it doesn't check if the fafile or the folder containing it exists before executing.
Instead it executes up to that point and then fails when it can't create indexes.

ngless "0.0"
import "parallel" version "0.0"
import "samtools" version "0.0"
import "mocat" version "0.0"

TMPDIR = ARGV[2]
DB = TMPDIR + '/db.fna'
DATADIR = 'data/'

sample = ARGV[1]
input = load_mocat_sample(DATADIR + sample)

preprocess(input, keep_singles=True) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:
        discard

hits = map(input, fafile=DB)

write(hits, ofile='outputs/' + sample + '_db.bam')

Called with ngless --trace map.ngl sample /tmp/non_existing_dir

CWL tool descriptions are missing outputs, therefore not usable in CWL workflows

Hey @luispedro , thanks for using the argparse2tool to generate CWL descriptions in https://github.com/luispedro/ngless/tree/master/scripts

However there are no output stanzas as argparse models the inputs to a program, not the outputs.

While the CWL ngless descriptions can be used to run a ngless command standalone, they can not be used in CWL workflows as stated in https://ngless.readthedocs.io/en/latest/faq.html#what-is-the-relationship-of-ngless-to-the-common-workflow-language due to the lack of outputs

At https://github.com/erasche/argparse2tool#cwl-specific-functionality we document how to pass in hand-written outputs stanzas using --output_section when invoking argparse2tool.

Alternatively I see some JSON-esque code at https://github.com/luispedro/ngless/blob/master/scripts/ngless-count.py#L35

You can pass the CWL input object directly as JSON to any tool and skip all the argument parsing complexity. Likewise any CWL compliant platform is able to consume JSON from a tool to learn at run time the actual outputs and their locations + any optional metadata.

I think DSLs are pretty cool and useful and I'd love to see more that compile or convert to CWL giving everyone the best of all worlds!

> make test

make test is not generating a executable at dist/ and the instruction to copy the executable to the directory root gives an error.

Bundled megahit is missing a version tag

Once extracted ngless --print-path megahit, none of the bundled files includes a reference to the NGLess version in its name.

Since NGLess currently only checks if the files exist, future releases should version tag all files.
The use-case of having different NGLess releases running in the same environment should be considered.

Update to use ghc from the edge testing repository for alpine linux

Opening this issue as an alert to anyone on github that appears to be using my ghc port.

Ghc and cabal are now upstreamed for x86_64 on alpine linux edge. Ghc requires alpine linux 3.5 or higher to run. But otherwise the upstreamed package is the same as my old port. With a caveat that the profiled ghc libraries are now in a sub package named ghc-dev.

Either add the edge testing repository to /etc/apk/repositories, or alternatively you may install ghc/cabal via:

# apk --no-cache add --repository http://dl-cdn.alpinelinux.org/alpine/edge/testing ghc ghc-dev cabal

For an up to date list of what is ported and where reference this search:
https://pkgs.alpinelinux.org/packages?name=&branch=&repo=&arch=&maintainer=Mitch+Tishmack

Note to build static binaries with ghc and musl libc with alpine linux, you only need to add to the ld-options for any executable in the .cabal file. The c runtime changes in this repo are unnecessary and will only increase the final binary size.

Example sed to update the .cabal file:

sed -i '/Executable .*/a \ \ ld-options: -static' package.cabal

Also note, I also ported upx to alpine edge, so you can also add upx from the edge testing repo if you want to test and validate that as well. Note however, upx does NOT compress dynamic musl binaries however, this is a upx limitation not the ports.

Unique Reduce

This feature allows to merge multiple files into one. Also this new file will not have more than N copies of a given object.

Check whether file is sorted in countsfile()

Every count file that NGLess generates is sorted by rowname, and collect relies on this. However, countsfile can be used to load a non-sorted file, which will lead to a very bad result.

countsfile should sort its input or (at the very least) raise an error if the input is not sorted.

Double gzipped bam with ".sam.gz" suffix

From tests/map3 and modifying to output .sam.gz

ngless '0.0'
input = paired('sample.1.fq', 'sample.2.fq', singles='sample.singles.fq')
mapped = map(input, fafile='ref.fna')
write(mapped, ofile='output.sam.gz')

creates output.sam.gz, a double gzipped bam file:

% mv output.sam.gz output.bam.gz.gz
% file output.bam.gz.gz 
output.sam.gz: gzip compressed data, max speed, from Unix
% gunzip output.bam.gz.gz 
% file output.bam.gz 
output.bam.gz: gzip compressed data, extra field
% gunzip output.bam.gz 
% file output.bam 
output.bam: SAMtools BAM (Binary Sequence Alignment/Map), with 4 reference sequences

counts with many features/options should result in many files

Not clear exactly on the desired behaviour/API. But here is a proposal:

counts = count(mapped, features=['A', 'B'], multiple=[{dist1}, {all1}])
write(counts, ofile='counts.{features}.{multiple}.txt')

would result in 4 files: counts.A.dist1.txt, counts.A.all1.txt, counts.B.dist1.txt, counts.B.all1.txt.

Support compressing output files with collect()

Using:

collect(counted,
        current=sample,
        allneeded=readlines(all_samples),
        ofile=outputdir + '/' + sample + '.tsv.gz')

Produced a sample.tsv.gz file that was not gzipped.
However collect() uses .gz internally for its partial files.

ofile= doesn't use output_directory

Currently output generated by collect() is saved to the current directory instead of the directory specified via -o output_directory.

ngless -o output script.ngl

where script contains:

collect(count(mapped, features=['seqname']),
    current=sample,
    allneeded=readlines('input.txt'),
    ofile='output.tsv'

produces 'output.tsv' instead of 'output/output.tsv'.

Create Temp dirs

Generate new dir instead of concatenating "_temp" to the end of a FileName.

'when-true' unused in flags in external modules

One argument in an external module:

 -    
                name: relative_abundance
                atype: flag
                when-true: '--make_relative_abundance'

which I then use with func(relative_abundance=true).

This however causes:

Exiting after fatal error while loading and running script
System Error
Error running command for function "func"
        exit code = 1
        stdout=''
        stderr=' /path/to/Modules/example.ngm/1.0/./run.sh: unrecognized option '--relative_abundance'

'

It seems the name is passed as-is.
The workaround is to give the same name as the argument but this causes problems with some options containing dashes: --max-values -> func(max-values=true).

Command line args available in scope (in particular temporary-directory)

I've recently missed having access to $TMPDIR or --temporary-directory as part of the script.
I couldn't find any way to access the environment or the command line arguments besides ARGV.

I'm currently using ARGV to workaround this limitation but calling ngless with: ngless script.ngl $TMPDIR.

As related-questions (maybe FAQ candidates):

  • Is there any builtin function that lists all the variables in the existing scope (akin to Python's locals() / globals()?
  • The documentation also doesn't explicitly list variables implicitly defined. Looking at the code I find STDIN, STDOUT and ARGV. Are there others?

ngless type-checking runs before version checking

ngless "0.6"

input = fastq('sample.fq.gz')
write(orf_find(assemble(input),
                is_metagenome=True),
    ofile='output.orfs.fna')

run with version 0.5.1 fails with:

Exiting after fatal error while loading and running script
Script Error
Error in type-checking (line 4): Unknown function 'orf_find'
Cannot continue typechecking.

instead of:

Exiting after fatal error while loading and running script
Script Error
Version 0.6 is not supported (only versions 0.0/0.5 are available in this release).

which is seen when the script contains:

ngless "0.6"

input = fastq('sample.fq.gz')
write(assemble(input),
    ofile='output.orfs.fna')

Filetype in external modules does not enforce number of inputs passed

When using external modules one can specify

        arg1:
            atype: 'readset'
            can_gzip: true

which passes one (.1.fq), two (.1.fq, .2.fq), or three (.1.fq, .2.fq, .single.fq) files depending on what the caller used or produced, fastq(), paired() or paired(singles=), respectively.

Current API specifies that you can add a filetype annotation/constraint to have ngless pass the expected format:

        arg1:
            atype: 'readset'
            filetype: 'fq3'
            can_gzip: true

This didn't work as expected. In the example above, even though fq3 (.1.fq, .2.fq, .single.fq) was specified ngless still passed either one, two or three .fq files.

The same happens if instead of fq3, fq1 is defined. Up to three files are still passed.

A testcase can be found here

ngless copies files untouched if they are written after reading

sample/A.pair.1.fq.gz
sample/A.pair.2.fq.gz
sample/B.pair.1.fq.gz
sample/B.pair.2.fq.gz
sample/C.single.fq.gz
ngless "0.0"
import "mocat" version "0.0"

input = load_mocat_sample('sample')
write(input, ofile='tmpdir/output.fq')

Produces output.pair.1.fq, output.pair.2.fq, output.single.fq all in the original compression instead of following the extension provided.
If the same files are re-read afterwards ngless will fail to parse since it relies on the file extension (not the MIMEtype) to recognize the format.

Unique Map

This feature will allow to divide the dataset.

Non-atomic write of compressed files

When using:

write(ofile="out.sam")

a .copyFileXXXX is created first and only when complete it is renamed to out.sam.

with:

write(ofile="out.sam.gz")

the final file is created immediately and can be seen changing size as it is written.

For consistency all output formats should use the .copyFileXXX approach.

Update samtools

This should just be a matter of updating the URLs in the Makefile, but sometimes we need to change the configuration options.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.