ngless-toolkit / ngless Goto Github PK

NGLess: NGS with less work

License: Other

Haskell 87.39% C 0.21% Makefile 0.54% Shell 1.41% CSS 0.63% HTML 1.53% Python 0.63% Nix 7.41% C++ 0.26%

haskell bioinformatics bioinformatics-pipeline samtools bwa next-generation-sequencing fastq-format fastq science ngs metagenomics genomics haskell-language

ngless's Introduction

NGLess: NGS Processing with Less Work

Ngless is a domain-specific language for NGS (next-generation sequencing data) processing.

For questions and discussions, please use the NGLess mailing list.

If you are using NGLess, please cite:

NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language by Luis Pedro Coelho, Renato Alves, Paulo Monteiro, Jaime Huerta-Cepas, Ana Teresa Freitas, Peer Bork, Microbiome (2019) https://doi.org/10.1186/s40168-019-0684-8

Example

ngless "1.5"
input = fastq(['ctrl1.fq','ctrl2.fq','stim1.fq','stim2.fq'])
input = preprocess(input) using |read|:
    read = read[5:]
    read = substrim(read, min_quality=26)
    if len(read) < 31:
        discard

mapped = map(input,
                reference='hg19')
write(count(mapped, features=['gene']),
        ofile='gene_counts.csv',
        format={csv})

For more information, check the docs. We also have a YouTube tutorial on how to use NGLess and SemiBin together (but you can learn to use NGLess independently of SemiBin).

Installing

See the install documentation for more information.

Bioconda

The recommended way to install NGLess is through bioconda:

conda install -c bioconda ngless

Docker

Alternatively, a docker container with NGLess is available at docker hub:

docker run -v $PWD:/workdir -w /workdir -it nglesstoolkit/ngless:1.5.0 ngless --version

Adapt the mount flags (-v) as needed.

Linux

You can download a statically linked version of NGless 1.5.0

This should work across a wide range of Linux versions (please report any issues you encounter):

curl -L -O https://github.com/ngless-toolkit/ngless/releases/download/v1.5.0/NGLess-v1.5.0-Linux-static-full
chmod +x NGLess-v1.5.0-Linux-static-full
./NGLess-v1.5.0-Linux-static-full

This downloaded file bundles bwa, samtools and megahit (also statically linked).

From Source

Installing/compiling from source is also possible. Clone https://github.com/ngless-toolkit/ngless

Dependencies

The simplest way to get an environment with all the dependencies is to use conda:

conda create -n ngless
conda activate ngless
conda config --add channels conda-forge
conda install stack cairo bzip2 gmp zlib perl wget xz pkg-config make

You should have gcc installed (or another C-compiler).

The following sequence of commands should download and build the software

git clone https://github.com/ngless-toolkit/ngless
cd ngless
stack setup
make

To install, you can use the following command (replace <PREFIX> with the directory where you wish to install, default is /usr/local):

make make

Running Sample Test Scripts on Local Machine

For developers who have successfully compiled and installed NGless, running the test scripts in the tests folder would be the next line of action to have the output of sample test cases.

cd tests

Once in the tests directory, select any of the test folders to run NGless.

For example, here we would run the regression-fqgz test:

cd regression-fqgz
ngless ungzip.ngl

After running this script open the newly generated folder ungzip.ngl.output_ngless and view the template in the index.html file.

For developers who have done this much more datasets for testing purposes can be referenced and used by reading these documentation links: Human Gut Metagenomics Functional & Taxonomic Profiling Ocean Metagenomics Functional Profiling Ocean Metagenomics Assembly and Gene Prediction

More information

Authors

Luis Pedro Coelho (email: [email protected]) (on twitter: @luispedrocoelho)
Paulo Monteiro
Renato Alves
Ana Teresa Freitas
Peer Bork

ngless's People

Contributors

Stargazers

Watchers

ngless's Issues

Document paired()

There's a few mentions on the docs but it's currently absent from the function list.
The argument singles= is also not listed.

The special value <references> can be used to refer to a common location for resources.
It's convenient to use with map(fafile="<references>/myfasta.fna", ...).
Its value comes from calling ngless with --search-dir.

Including "-1" in count() should be default

Practical experience seems to indicate that users want it.

Compress temporary files

This should use something like lz4 or zstandard so that there won't be a disk/speed tradeoff.

Multiple `-1` with count() when multiple features/subfeatures are requested

This is related to #34.

The special case of -1 is currently written several times to the output.

With multiple subfeatures, there's an additional bug in that the -1 line doesn't include the 2 identifying prefixes:

	reads.fq.gz
-1	4
gene	gene_id	feature_A	1
gene	gene_id	feature_B	2
gene	gene_id	feature_C	2
gene	gene_id	feature_D	1
-1	4
gene	gene_name	featA	1
gene	gene_name	featB	2
gene	gene_name	featC	2
gene	gene_name	featD	1

GFF subfeature counting should be expanded

The GFF3 format specification allows for multiple value attributes if separated with a comma.
From the official docs:

Parent=AF2312,AB2812,abc-3

With the current version of NGLess a GFF file:

##gff-version 3
reference	protein_coding	gene	40	100	.	+	.	gene_id=geneA;gene_name=featA1,featA2
reference	protein_coding	gene	110	130	.	+	.	gene_id=geneB;gene_name=featA1
reference	protein_coding	gene	140	200	.	+	.	gene_id=geneC;gene_name=featA2

and a script:

    ngless "0.7"

    input = fastq('reads.fq.gz')
    mapped = map(input, fafile='ref.fna.gz')

    union = count(mapped,
                  gff_file='features.gff',
                  features=['gene'],
                  subfeatures=['gene_name'],
                  mode={union})
    write(union, ofile='output.txt')

produces:

	reads.fq.gz
-1	4
featA1	0
featA1,featA2	4
featA2	1

however the expectation is that values are expanded.
Additionally, expansion should take into account the content of arguments mode= and multiple= in count():

	reads.fq.gz
-1	4
featA1	4
featA2	5

Document `assemble()`

The docs referencing orf_find() mention assemble() which isn't mentioned anywhere else.

map-split: use relative instead of absolute path for symlinks

As part of the split-map strategy implemented in ngless 0.6 symlinks are created to avoid copying split files. As of this version symlinks use absolute paths.
While this works on a single machine it may break on shared filesystems (different mountpoints) or if the index folder needs to be manually relocated.

Making symlinks relative to the root of the index folder, addresses the issues mentioned above.

Add functionality to pre-install modules

As it stands, references can be pre-installed, but not modules.

installDS scripts do not detect errors

Killing bwa halfway through the process results in a half-baked output as the overall process just happily continues running.

Extend select() to operate per-read

This would break the link between mates in the same paired-end read, but it is useful for some uses

Clarify documentation of `mode` in `count()` to be more user friendly

count( mode = ... )'s documentation is unclear and perhaps a little low-level. It explains how the options relate to different sets but not how these sets relate to the read or the features being counted.

For example, if a read maps to a region with two non-overlapping features, and the option mode={intersection-nonempty} is used, does it mean that the read is not considered?

Perhaps some illustration will be clearer. ASCII-art to the rescue:

Reference  ###################################

Feature_A       ========
Feature_B            ==========
Feature_C                  ============

Read1              -----
Read2               -----
Read3                  -----
Read4                         -----
Read5                               -----
Read6                                   -----

Read1 is contained in Feature_A and partially overlaps Feature_B.
Read2 is not contained in any feature but partially overlaps Feature_A and Feature_B.
Read3 is contained in Feature_B and partially overlaps Feature_A and Feature_C.
Read4 is contained in Feature_C
Read5 partially overlaps Feature_C
Read6 doesn't overlap any feature.

And the following perhaps in the form of a table

With:

mode={union}
- Read1 is counted for Feature_A...
- Read2 is counted for Feature_A...
- Read3 is counted for Feature_B...
- Read4 is counted for Feature_C...
- Read5 is counted ...
- Read6 is never counted
mode={intersection-strict}
- Read1 is ...
- Read2 is ...
- Read3 is ...
- Read4 is ...
- Read5 is ...
- Read6 is never counted
mode={intersection-nonempty}
- Read1 is ...
- Read2 is ...
- Read3 is ...
- Read4 is ...
- Read5 is ...
- Read6 is never counted

Intermediate results on /tmp

Re-create indices if reference is modified

When mapping with NGLess indices will be created if they don't exist.
Once they do exist, they are currently not recreated if a reference is modified.

Add interleaved FastQ support

The format is not formally described but is used in the wild. On a quick search there was no mention on how 'singles' are handled. Possibilities include:

Output .1 followed by .2 and add singles at the end of the file
Tolerate unpaired singles in the middle of the file.

The second variant is more versatile (e.g. for filter()) as it doesn't require a second file to hold reads as they are being processed.

External module and parser limitations

arg1 fails if of `atype: 'str'`

Using Modules/test.ngm/0.1/module.yaml:

version: '0.1'
name: 'test'
functions:
    -
        nglName: "test"
        arg0: './dummy.sh'
        arg1:
            atype: 'str'
        return:
            rtype: 'void'
            name: 'output'
            extension: 'void'

and test.ngl:

ngless "0.6"
local import "test" version "0.1"

sample = "test"
test(sample)

results in:

Exiting after fatal error while loading and running script
Should Not Occur Error! This probably indicates a bug in ngless.
        Please get in touch with the authors with a description of how this happened.
        If possible run your script with the --trace flag and post the script and the resulting trace at 
                http://github.com/luispedro/ngless/issues
        or email us at [email protected].
AsFile path got NGOString "test"

`return: rtype: void` requires `name` and `extension`.

Despite using rtype = void, NGLess complained when I didn't include a name and an extension argument as part of return:

Could not load module file ./Modules/test.ngm/0.1/module.yaml. Error was `Error in $.functions[0].return: key "name" not present`
Could not load module file ./Modules/test.ngm/0.1/module.yaml. Error was `Error in $.functions[0].return: key "extension" not present`

Cannot define function with named arguments only

Trying to omit arg1 results in a parsing error.

This is both visible in the case of:

local import "test" version "0.1"
test()

as well as in:

local import "test" version "0.1"
sample = "test"
test(name=sample)

resulting in:

unexpected TOperator ')'
expecting len (reserved word), operator -, not (reserved word), operator (, function call, operator [ or variable

unexpected TOperator '='
expecting binary operator, keyword argument list or operator )

Fix error message when load_mocat_sample is used on non-existing directory

Should say "directory not found".

Explicit way of invalidating locks

Sometimes when I'm testing things, jobs fail and leave active locks behind.
Subsequent calls fail as they cannot obtain any lock.

Currently I workaround this by removing the ngless-locks folder or the matching subfolder inside ngless-locks which is suboptimal.

With that said, it would be nice if after:

ngless --options ... mycustomscript.ngl

one could:

ngless --clear-locks mycustomscript.ngl

Decrease memory usage of count() with seqname

Currently, the code uses a sorted Vector InfoRef where InfoRef is

data InfoRef = InfoRef {-# UNPACK #-} !ShortByteString {-# UNPACK #-} !Double

but this has a lot of memory overhead compared to something like a C++ std::vector<std::pair<const char*, double>> where the string data is packed together.

This could easily be done as a generic Haskell library independent of any NGLess code.

Compute mapstats after select()

Currently, select() computes mapstats only if there is no block.

Add search path for references

For example:

map(input, fafile="<>/catalog/file.fna")

and ngless --path "/usr/share/...:/opt/share/...:..." would look into all the given directories.

Support outputting count files in binary format

Instead of CSV/TSV format.

Argument not checked prior to execution

According to the FAQ ngless is supposed to check all input files prior to execution.

In the case below it doesn't check if the fafile or the folder containing it exists before executing.
Instead it executes up to that point and then fails when it can't create indexes.

ngless "0.0"
import "parallel" version "0.0"
import "samtools" version "0.0"
import "mocat" version "0.0"

TMPDIR = ARGV[2]
DB = TMPDIR + '/db.fna'
DATADIR = 'data/'

sample = ARGV[1]
input = load_mocat_sample(DATADIR + sample)

preprocess(input, keep_singles=True) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:
        discard

hits = map(input, fafile=DB)

write(hits, ofile='outputs/' + sample + '_db.bam')

Called with ngless --trace map.ngl sample /tmp/non_existing_dir

CWL tool descriptions are missing outputs, therefore not usable in CWL workflows

Hey @luispedro , thanks for using the argparse2tool to generate CWL descriptions in https://github.com/luispedro/ngless/tree/master/scripts

However there are no output stanzas as argparse models the inputs to a program, not the outputs.

While the CWL ngless descriptions can be used to run a ngless command standalone, they can not be used in CWL workflows as stated in https://ngless.readthedocs.io/en/latest/faq.html#what-is-the-relationship-of-ngless-to-the-common-workflow-language due to the lack of outputs

At https://github.com/erasche/argparse2tool#cwl-specific-functionality we document how to pass in hand-written outputs stanzas using --output_section when invoking argparse2tool.

Alternatively I see some JSON-esque code at https://github.com/luispedro/ngless/blob/master/scripts/ngless-count.py#L35

You can pass the CWL input object directly as JSON to any tool and skip all the argument parsing complexity. Likewise any CWL compliant platform is able to consume JSON from a tool to learn at run time the actual outputs and their locations + any optional metadata.

I think DSLs are pretty cool and useful and I'd love to see more that compile or convert to CWL giving everyone the best of all worlds!

> make test

make test is not generating a executable at dist/ and the instruction to copy the executable to the directory root gives an error.

Bundled megahit is missing a version tag

Once extracted ngless --print-path megahit, none of the bundled files includes a reference to the NGLess version in its name.

Since NGLess currently only checks if the files exist, future releases should version tag all files.
The use-case of having different NGLess releases running in the same environment should be considered.

Update to use ghc from the edge testing repository for alpine linux

Opening this issue as an alert to anyone on github that appears to be using my ghc port.

Ghc and cabal are now upstreamed for x86_64 on alpine linux edge. Ghc requires alpine linux 3.5 or higher to run. But otherwise the upstreamed package is the same as my old port. With a caveat that the profiled ghc libraries are now in a sub package named ghc-dev.

Either add the edge testing repository to /etc/apk/repositories, or alternatively you may install ghc/cabal via:

# apk --no-cache add --repository http://dl-cdn.alpinelinux.org/alpine/edge/testing ghc ghc-dev cabal

For an up to date list of what is ported and where reference this search:
https://pkgs.alpinelinux.org/packages?name=&branch=&repo=&arch=&maintainer=Mitch+Tishmack

Note to build static binaries with ghc and musl libc with alpine linux, you only need to add to the ld-options for any executable in the .cabal file. The c runtime changes in this repo are unnecessary and will only increase the final binary size.

Example sed to update the .cabal file:

sed -i '/Executable .*/a \ \ ld-options: -static' package.cabal

Also note, I also ported upx to alpine edge, so you can also add upx from the edge testing repo if you want to test and validate that as well. Note however, upx does NOT compress dynamic musl binaries however, this is a upx limitation not the ports.

Unique Reduce

This feature allows to merge multiple files into one. Also this new file will not have more than N copies of a given object.

Check whether file is sorted in countsfile()

Every count file that NGLess generates is sorted by rowname, and collect relies on this. However, countsfile can be used to load a non-sorted file, which will lead to a very bad result.

countsfile should sort its input or (at the very least) raise an error if the input is not sorted.

Prodigal produces different output depending on GCC version used to compile it

This is not an NGLess bug but is meant to track progress upstream.

hyattpd/Prodigal#34

This issue was first noticed when tests on our alpine build-image failed due to a different output but passed on the ubuntu-image.

Ability to customize location of indices for references

Ngless assumes that indices exist in the same location as reference files.
In some cases this location may be read-only.

Allowing the user to specify a different location for indices would allow working around this.

Double gzipped bam with ".sam.gz" suffix

From tests/map3 and modifying to output .sam.gz

ngless '0.0'
input = paired('sample.1.fq', 'sample.2.fq', singles='sample.singles.fq')
mapped = map(input, fafile='ref.fna')
write(mapped, ofile='output.sam.gz')

creates output.sam.gz, a double gzipped bam file:

% mv output.sam.gz output.bam.gz.gz
% file output.bam.gz.gz 
output.sam.gz: gzip compressed data, max speed, from Unix
% gunzip output.bam.gz.gz 
% file output.bam.gz 
output.bam.gz: gzip compressed data, extra field
% gunzip output.bam.gz 
% file output.bam 
output.bam: SAMtools BAM (Binary Sequence Alignment/Map), with 4 reference sequences

Early check for column headers in `count(.., functional_map="file.map")`

If the user does

c = count(mapped, features=['DOES_NOT_EXIST'], functional_map='mymap.map')

and DOES_NOT_EXIST is missing, then the user will get an error only when count is run.

NGLess should be able to warn immediately.

uncompress function 'write' results

output stats (mapping/fastqs) when using lock1()

Currently, these get computed and lost.

counts with many features/options should result in many files

Not clear exactly on the desired behaviour/API. But here is a proposal:

counts = count(mapped, features=['A', 'B'], multiple=[{dist1}, {all1}])
write(counts, ofile='counts.{features}.{multiple}.txt')

would result in 4 files: counts.A.dist1.txt, counts.A.all1.txt, counts.B.dist1.txt, counts.B.all1.txt.

Support compressing output files with collect()

Using:

collect(counted,
        current=sample,
        allneeded=readlines(all_samples),
        ofile=outputdir + '/' + sample + '.tsv.gz')

Produced a sample.tsv.gz file that was not gzipped.
However collect() uses .gz internally for its partial files.

ofile= doesn't use output_directory

Currently output generated by collect() is saved to the current directory instead of the directory specified via -o output_directory.

ngless -o output script.ngl

where script contains:

collect(count(mapped, features=['seqname']),
    current=sample,
    allneeded=readlines('input.txt'),
    ofile='output.tsv'

produces 'output.tsv' instead of 'output/output.tsv'.

--search-dir is not expanded prior to passing to external module

external(file="<data>/input.fa")

called with ngless --search-dir data=/somedir results in calling external() with <data>/input.fa instead of /somedir/input.fa.

Testcase here

Create Temp dirs

Generate new dir instead of concatenating "_temp" to the end of a FileName.

'when-true' unused in flags in external modules

One argument in an external module:

 -    
                name: relative_abundance
                atype: flag
                when-true: '--make_relative_abundance'

which I then use with func(relative_abundance=true).

This however causes:

Exiting after fatal error while loading and running script
System Error
Error running command for function "func"
        exit code = 1
        stdout=''
        stderr=' /path/to/Modules/example.ngm/1.0/./run.sh: unrecognized option '--relative_abundance'

'

It seems the name is passed as-is.
The workaround is to give the same name as the argument but this causes problems with some options containing dashes: --max-values -> func(max-values=true).

Command line args available in scope (in particular temporary-directory)

I've recently missed having access to $TMPDIR or --temporary-directory as part of the script.
I couldn't find any way to access the environment or the command line arguments besides ARGV.

I'm currently using ARGV to workaround this limitation but calling ngless with: ngless script.ngl $TMPDIR.

As related-questions (maybe FAQ candidates):

Is there any builtin function that lists all the variables in the existing scope (akin to Python's locals() / globals()?
The documentation also doesn't explicitly list variables implicitly defined. Looking at the code I find STDIN, STDOUT and ARGV. Are there others?

Make safeio dependency optional for Windows compilation

Safeio requires the unix package and, thus, cannot be used on Windows.

ngless type-checking runs before version checking

ngless "0.6"

input = fastq('sample.fq.gz')
write(orf_find(assemble(input),
                is_metagenome=True),
    ofile='output.orfs.fna')

run with version 0.5.1 fails with:

Exiting after fatal error while loading and running script
Script Error
Error in type-checking (line 4): Unknown function 'orf_find'
Cannot continue typechecking.

instead of:

Exiting after fatal error while loading and running script
Script Error
Version 0.6 is not supported (only versions 0.0/0.5 are available in this release).

which is seen when the script contains:

ngless "0.6"

input = fastq('sample.fq.gz')
write(assemble(input),
    ofile='output.orfs.fna')

Filetype in external modules does not enforce number of inputs passed

When using external modules one can specify

        arg1:
            atype: 'readset'
            can_gzip: true

which passes one (.1.fq), two (.1.fq, .2.fq), or three (.1.fq, .2.fq, .single.fq) files depending on what the caller used or produced, fastq(), paired() or paired(singles=), respectively.

Current API specifies that you can add a filetype annotation/constraint to have ngless pass the expected format:

        arg1:
            atype: 'readset'
            filetype: 'fq3'
            can_gzip: true

This didn't work as expected. In the example above, even though fq3 (.1.fq, .2.fq, .single.fq) was specified ngless still passed either one, two or three .fq files.

The same happens if instead of fq3, fq1 is defined. Up to three files are still passed.

A testcase can be found here

ngless copies files untouched if they are written after reading

sample/A.pair.1.fq.gz
sample/A.pair.2.fq.gz
sample/B.pair.1.fq.gz
sample/B.pair.2.fq.gz
sample/C.single.fq.gz

ngless "0.0"
import "mocat" version "0.0"

input = load_mocat_sample('sample')
write(input, ofile='tmpdir/output.fq')

Produces output.pair.1.fq, output.pair.2.fq, output.single.fq all in the original compression instead of following the extension provided.
If the same files are re-read afterwards ngless will fail to parse since it relies on the file extension (not the MIMEtype) to recognize the format.

Unique Map

This feature will allow to divide the dataset.

Non-atomic write of compressed files

When using:

write(ofile="out.sam")

a .copyFileXXXX is created first and only when complete it is renamed to out.sam.

with:

write(ofile="out.sam.gz")

the final file is created immediately and can be seen changing size as it is written.

For consistency all output formats should use the .copyFileXXX approach.

NGLess should restart when using `parallel` and tasks are not finished

Currently, the expectation is that the user will run ngless enough times, but that's not a great UI.

Update samtools

This should just be a matter of updating the URLs in the Makefile, but sometimes we need to change the configuration options.