meuleman / epilogos Goto Github PK
View Code? Open in Web Editor NEWMethods for summarizing and visualizing multi-biosample functional genomic annotations
Home Page: https://epilogos.net
License: GNU General Public License v3.0
Methods for summarizing and visualizing multi-biosample functional genomic annotations
Home Page: https://epilogos.net
License: GNU General Public License v3.0
Opening and immediately closing a file handle looks a bit odd. Is this perhaps being used to create an empty file?
epilogos/src/computeEpilogosSlurm.py
Lines 365 to 373 in 80bf1f5
It might be harmless and work fine, if so, but for readability, there are other ways to touch a file that might be a bit clearer, e.g.:
from pathlib import Path
Path('path/to/file').touch()
I think it would be useful to have more comments on what the output for the greatest hits is for people who are less familiar with it.
epilogos/scripts/computeEpilogos.sh
Line 3 in 80bf1f5
With production bash scripts shared with others, I suggest using set -u
with -e -o pipefail
.
This will cause the script to exit with a non-zero error, if a variable is left unset or is otherwise undefined.
This is useful when using positional variables like ${1}
, ${2}
, etc. which happens later on in this script. If a variable is left out by accident, the script will stop early.
I also saw this here:
It may be useful to add set -ueo pipeline
in here, as well:
The following script may not be obsolete:
epilogos/scripts/preprocess_data_ChromHMM.sh
Lines 1 to 33 in 80bf1f5
Some general suggestions that I have are:
Use set -ueo pipefail
below the shebang line to ensure variables are set and to handle error states more closely. You could even add an -x
to log each command. This can be useful for pipelines, log statements, or debugging a problematic script.
Try to avoid using uppercase variable names in bash scripts. Some variables in bash are reserved POSIX names and your intended values will get clobbered, if you accidently use these variable names. It can lead to non-obvious and frustrating errors.
Use curly braces around positional variables like $1
to ${1}
, $2
to ${2}
etc. Not a big deal, here, but a good habit to get into as this allows you to specify more than nine input parameters ($10
will not work, but ${10}
will, for example).
epilogos/src/computeEpilogosPairwiseVisual.py
Lines 223 to 240 in d3786b2
Several lines mention this, let's replace with sampling/sample to avoid confusion
Hello,
I wounder if you can elaborate on the interpretation of the negative values in epilogos?
Another question is: does the high value at a particular poistion means that the state at this particular position is the same across all the used tracks? .i.e not so much variability across the different tracks?
Thank you!
Abdul
This may be looked at in the course of preparing epilogos for distribution via PyPI, but without reading the README and just looking at the scripts
and src
directories, I wasn't quite sure what the starting point or entry point was to running things.
A lot of people don't read documentation and just jump in. If I didn't read the README, I'd have assumed src/epilogos.py
might be the starting point.
Perhaps simplifying things to a dedicated, single "Pythonic" entry point might be helpful. This could call either the Slurm and single-node compute mode of the epilogos kit, instead of pointing users to one or another script.
https://amir.rachum.com/blog/2017/07/28/python-entry-points/
https://chriswarrick.com/blog/2014/09/15/python-apps-the-right-way-entry_points-and-scripts/
epilogos/src/columnSplitterAdsera.py
Lines 108 to 127 in 80bf1f5
Using click
or similar libraries could help simplify handling options (and boolean-flag options).
epilogos/src/computeEpilogosSlurm.py
Lines 463 to 466 in 80bf1f5
Is there a reason you use sleep
here, and different time values? (Is it to delay submission of separate jobs to the cluster?)
Hi,
I was trying the examples.
computeEpilogos_singleChromosomeSingleProcessor.sh data/chr1_127epigenomes_15observedStates.txt.gz 0 15 KL "33-34,37-45,47-48,61"
ls KL
chr1_Pnumerators.txt p1a_chr1.stderr p1a_chr1.stdout p2_chr1.stderr p2_chr1.stdout Q.txt
cat p2_chr1.stderr
Usage type #1: /scratch/genomic_med/apps/epilogos/bin/computeEpilogosPart2_perChrom infile KLtype NsitesGenomewide infileQ outfileObs outfileQcat chr [infileQ2]
where
* infile holds the tab-delimited state or state pair IDs observed in the epigenomes or pairs of epigenomes, one line per genomic segment
* KLtype is either 0 (for KL), 1 (for KL*), or 2 (for KL**)
KL compares states, KL* compares tallies of state pairs, and KL** compares state pairs of individual epigenome pairs
* NsitesGenomewide is the total number of sites observed genome-wide
* infileQ contains the Q, Q*, or Q** tally matrix (also see below)
* outfileObs will receive genomic coordinates (regions on chromosome "chr" of width regionWidth, starting at firstBegPos),
the state (or state pair) making the largest contribution to the metric,
the magnitude of that contribution, and the total value of the metric.
If two groups are specified (see below), it will also include a column containing +/-1,
specifying whether the first group (+1) or the second (-1) contributes more to the overall metric.
* outfileQcat will be in "qcat" format, uncompressed
* Optional additional argument infileQ2 can be used to specify Q, Q*, or Q** for a 2nd group of epigenomes,
in which case the metric quantifies the difference (distance) between them.
Usage type #2: /scratch/genomic_med/apps/epilogos/bin/computeEpilogosPart2_perChrom infile KLtype NsitesGenomewide infileQ1 infileQ2 outfileNulls
where
* infile contains random permutations of states (or state pairs) observed in the initial input data
* outfileNulls will receive the total difference metric for each line of permuted states (or state pairs)
* the remaining arguments are the same as described above
This second "usage type" is used to generate a null distribution, for estimating significance
of the metric values calculated via "usage type 1."
other files are empty except (60M in size) chr1_Pnumerators.txt
. any ideas what's wrong?
BTW, do you have an estimate on time and CPU usage for the examples?
Thanks!
Tommy
The arguments that computeEpilogos_singleChromosomeSingleProcessor.sh sends to computeEpilogosPart2_perChrom.cpp need to be the same as the ones that computeEpilogos.sh sends to computeEpilogosPart2_perChrom.cpp.
epilogos/src/computeEpilogosSlurm.py
Line 453 in 80bf1f5
Using a try/except block where you use subprocess
calls can help with catching local issues with running command-line tools outside of Python:
try:
subprocess.run("scancel {}".format(allJobIDs), shell=True)
except subprocess.CalledProcessError as e:
# handle error...
Certain error codes might have to do with the binary not being found, or in the case of scancel
, one or more job IDs passed to it might not found associated with any jobs internally, etc. Better handling of errors can help end users catch odd problems, or help with debugging.
epilogos/src/computeEpilogosSlurm.py
Lines 378 to 383 in 80bf1f5
Instead of hardcoding Slurm parameters like partition, memory, tasks-per-CPU etc., consider exposing them via click
options. These parameters could be set to Altius-specific defaults, to make the kit easier to run locally.
Hi,
I am eager to play around with epilogos, could you please add a bit more README?
Thanks!
Tommy
Hello again, I'm running into another issue that is unfortunately limiting my ability to use epilogos outputs for downstream analysis.
This is 1 example command from our pipeline (I'm using a singularity container, so "singularity exec episead.sif" just runs the rest of the command, but inside the container):
singularity exec episead.sif computeEpilogos_minimal.sh wt/04epilogos_input_by_chr/chr5.txt 13 1 $output_dir "1-9" > wt/05epi_out/chr5/scores.txt.gz
This is effectively:
computeEpilogos_minimal.sh wt/04epilogos_input_by_chr/chr5.txt 13 1 $output_dir "1-9" > wt/05epi_out/chr5/scores.txt.gz
Here is my error message:
*** buffer overflow detected ***: /usr/local/bin/computeEpilogosPart2_perChrom terminated
/usr/local/bin/computeEpilogos_minimal.sh: line 213: 66552 Aborted (core dumped) $EXE1 $infile1 $metric $numStates $outfileP $outfileQ $outfileNsites $groupAspec $groupBspec $outfileRand$outfileQB > $ {outdir}/${jobName}.stdout 2> ${outdir}/${jobName}.stderr
Or:
/usr/local/bin/computeEpilogos_minimal.sh: line 301: 451205 Aborted (core dumped) $EXE2 $infile $metric $totalNumSites $infileQ $outfileObserved $outfileScores $chr
$infileQB > $ {outdir}/${jobName}.stdout 2> ${outdir}/${jobName}.stderr
Expected behavior:
Computing epilogos on a single chromosome bed file
Actual behavior:
How I'm using epilogos:
I am running computeEpilogos_minimal.sh on a per-chromosome basis on an LSF cluster as part of a pipeline. I'm running it like that because we use a pipeline based on an older version of epilogos where it was easier to run per-chromosome, and computeEpilogos_minimal.sh appears to be written to fulfill the same function. The error occurs within the pipeline and when run manually.
Environment:
The test-run recommended in the ReadMe works without error.
Here are lines from a file that's breaking after 4580 lines (the original file is tab-separated):
line number 400:
chr5 400000 401000 5 6 2 5
line number 500:
chr5 500000 501000 11 7 10 10
line number 600:
chr5 600000 601000 12 11 12 12
line number 17600:
chr5 17600000 17601000 8 8 8 8
line number 17700:
chr5 17700000 17701000 8 8 8 8
line number 17800:
chr5 17800000 17801000 4 8 2 4
I don't see any differences in the lines so I don't think the error is because of our input files.
Here are the lines that would be breaking it:
line number 4578:
chr5 4578000 4579000 4 8 8 3
line number 4579:
chr5 4579000 4580000 13 8 8 3
line number 4580:
chr5 4580000 4581000 13 13 13 9
line number 4581:
chr5 4581000 4582000 13 13 13 9
line number 4582:
chr5 4582000 4583000 13 13 4 3
Here are the final lines of the apparently truncated epilogos output file:
chr5 4575000 4576000 0 0 0.813 0.1082 0 0 0 0.1067 0 0 0 0 0
chr5 4576000 4577000 0 0 0.813 0.1082 0 0 0 0.1067 0 0 0 0 0
chr5 4577000 4578000 0 0 0.813 0.1082 0 0 0 0.1067 0 0 0 0 0
chr5 4578000 4579000 0 0 0.813 0.1082 0 0 0 0.1067 0 0 0 0 0
chr5 4579000 4580000 0 0 0.813 0 0 0 0 0.1067 0 0 0 0 0.8291
Any thoughts or feedback would be greatly appreciated. Thank you.
Hi,
After successfully run epilogos on the test data, I tried computeEpilogos_minimal.sh on my own data using the S3 mode, but got only the statePairs.txt for each chromosome, no observations.starch, scores.txt.gz or exemplarRegions.txt file. Also no error messages. I checked my input data format and it seems to be exactly the seem with the test data in the epilogo package. So I am now really puzzled..Could you please point out what could be wrong? Thanks.
Yan
Hi, I'm trying to use epilogos and am confused about the groupSpec argument. With how I understand the documentation and comments in the epilogos code, it appears that groupSpec is supposed to state which columns should be included in analysis, and that the first sample (from column 4) should be "1" in this argument because it is the first sample. If this is wrong, please let me know.
Based on this understanding, if I have 9 samples, I would expect "1-9" to work for the groupSpec argument, and it does, but I also tried some other inputs to make sure I understood, such as "2-10", which also works and has slightly different outputs to the scores.txt.gz file. Following that trend, "3-11" raises an error:
/usr/local/bin/computeEpilogos_minimal.sh: line 213: 390467 Segmentation fault (core dumped) $EXE1 $infile1 $metric $numStates $outfileP $outfileQ $outfileNsites $groupAspec $groupBspec $outfileRand
$outfileQB > $ {outdir}/${jobName}.stdout 2> ${outdir}/${jobName}.stderr
But "4-12" works and so does "4-13".
This is from my input file:
chr3 0 1000 8 8 8 8 8 8 8 8 8
chr3 1000 2000 8 8 8 8 8 8 8 8 8
chr3 2000 3000 8 8 8 8 8 8 8 8 8
chr3 3000 4000 8 8 8 8 8 8 8 8 8
chr3 4000 5000 8 8 8 8 8 8 8 8 8
This is the command I run (for "4-12"):
computeEpilogos_minimal.sh 04epilogos_input_by_chr/chr3.txt 13 1 05epilogos_output_by_chr/chr3 "4-12" > 05epilogos_output_by_chr/chr3/scores.txt.gz
Any input on this topic would be appreciated.
Hi,
Could you please give some specification of the output?
Thanks,
Tommy
epilogos/src/columnSplitterAdsera.py
Lines 65 to 67 in 80bf1f5
The print
function hides that it prints by default to standard error, and it doesn't work the same between Python 2 and 3. Using sys.stderr.write
is explicit and portable, and isn't much more typing.
I would suggest replacing print()
with a sys.stderr.write('\n')
or similar to add newline characters. Newlines could also be added to previous print
calls.
If these print commands are used for logging, the logging
functionality might be useful to look into (https://docs.python.org/3/library/logging.html) for more explicit control of where messages go.
epilogos/src/computeEpilogosSlurm.py
Line 30 in 80bf1f5
Instead of using a default string called null
, I would suggest using the Python None
keyword to define a null value.
Later on, comparisons can be made more "Pythonic", e.g.:
epilogos/src/computeEpilogosSlurm.py
Lines 73 to 74 in 80bf1f5
can become:
if not expFreqDir: print("Background Directory = {}".format(outputDirPath))
epilogos/src/computeEpilogosSlurm.py
Lines 83 to 84 in 80bf1f5
can become a simpler ternary, perhaps:
expFreqDir = outputDirectory if not expFreqDir else None
Having a requirements.txt
file at the root level of the project makes installing dependencies as simple as: pip install -r requirements.txt
.
You can edit this file whenever new dependencies are needed, without having to keep track of the list of packages in the project README.
Another advantage is that you can specify specific or minimum versions of numpy, etc. in this text file. This helps with making a reproducible setup on other systems, in case you need to debug things. Newer versions of numpy etc. can change function parameters and so on, which could make your code dependent on specific versions.
To create this file with a current environment where you know things are working, you can run pip freeze > requirements.txt
in the root directory. The file will be a list of all the dependencies and versions (I think).
This could also be generally useful later on for packaging.
epilogos/src/computeEpilogosExpectedMaster.py
Lines 87 to 113 in 80bf1f5
Type hints can be passed into documentation and linter scripts to automate writing docs and checking the types of data being passed around, including Python primitives (str, int, etc.) and numpy data:
https://www.python.org/dev/peps/pep-0484/
https://numpy.org/devdocs/reference/typing.html
This can also be useful to read the function at a glance, to see what goes in and out, and help with debugging.
Hi,
I want to know the reason why one needs to process by chromosome using computeEpilogos_singleChromosomeSingleProcessor.sh
. Is it because the memory usage is too big if I process all chromosomes at once?
Thanks,
Tommy
When running the minimal, non-Slurm paired test, I get the following numpy
-related error:
TypeError: Cannot compare types 'ndarray(dtype=int32)' and 'str'
Details:
(epilogos_052421_v3) areynolds@Earl-Grey epilogos % tmp_dir=$(mktemp -d -t epilogos-paired-XXXXXXXXXX)
(epilogos_052421_v3) areynolds@Earl-Grey epilogos % epilogos -m paired -l -a data/pyData/male/ -b data/pyData/female/ -n data/state_metadata/human/Boix_et_al_833_sample/hg19/18/metadata.tsv -o ${tmp_dir}
d8b 888
Y8P 888
888
.d88b. 88888b. 888 888 .d88b. .d88b. .d88b. .d8888b
d8P Y8b 888 "88b 888 888 d88""88b d88P"88b d88""88b 88K
88888888 888 888 888 888 888 888 888 888 888 888 "Y8888b.
Y8b. 888 d88P 888 888 Y88..88P Y88b 888 Y88..88P X88
"Y8888 88888P" 888 888 "Y88P" "Y88888 "Y88P" 88888P'
888 888
888 Y8b d88P
888 "Y88P"
Input Directory 1 = /Users/areynolds/Developer/Github/epilogos/data/pyData/male
Input Directory 2 = /Users/areynolds/Developer/Github/epilogos/data/pyData/female
State Model = 18
Saliency level = 1
Output Directory = /var/folders/h5/sggy_r7s355gnwssgl9mw44c0000gq/T/epilogos-paired-XXXXXXXXXX.v0iSPsWB
Number of Cores = All available
Quiescent State = 18
STEP 1: Per data file background frequency calculation
epilogos_matrix_chr1 [Done]
STEP 2: Background frequency combination
[Done]
STEP 3: Score calculation
epilogos_matrix_chr1 ......... [Done]
STEP 4: Generating p-values and figures
Fitting distances [Done]
Reading in files [Done]
Calculating p-vals [Done]
Writing metrics [Done]
Benjamini-Hochberg procedure [Done]
Greatest hits txt Traceback (most recent call last):
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/bin/epilogos", line 33, in <module>
sys.exit(load_entry_point('epilogos', 'console_scripts', 'epilogos')())
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/areynolds/Developer/Github/epilogos/epilogos/__main__.py", line 64, in main
runEpilogos(mode, commandLineBool, inputDirectory, inputDirectory1, inputDirectory2, outputDirectory, stateInfo, saliency,
File "/Users/areynolds/Developer/Github/epilogos/epilogos/epilogos.py", line 214, in main
pairwiseVisual.main(inputDirPath.name, inputDirPath2.name, stateInfo, outputDirPath, fileTag, numProcesses,
File "/Users/areynolds/Developer/Github/epilogos/epilogos/pairwiseVisual.py", line 106, in main
createTopScoresTxt(roiPath, locationArr, chrDict, distanceArrReal, maxDiffArr, stateNameList, pvals, False, mhPvals)
File "/Users/areynolds/Developer/Github/epilogos/epilogos/pairwiseVisual.py", line 548, in createTopScoresTxt
locations = pd.DataFrame(np.concatenate((locationArr[indices], distanceArr[indices].reshape(len(indices), 1),
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/frame.py", line 4379, in replace
return super().replace(
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/generic.py", line 6500, in replace
return self.replace(
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/frame.py", line 4379, in replace
return super().replace(
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/generic.py", line 6518, in replace
return self._replace_columnwise(mapping, inplace, regex)
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/frame.py", line 4415, in _replace_columnwise
newobj = ser.replace(target, value, regex=regex)
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/series.py", line 4563, in replace
return super().replace(
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/generic.py", line 6543, in replace
new_data = self._mgr.replace_list(
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 642, in replace_list
masks = [comp(s, mask, regex) for s in src_list]
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 642, in <listcomp>
masks = [comp(s, mask, regex) for s in src_list]
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 636, in comp
return _compare_or_regex_search(values, s, regex, mask)
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 1992, in _compare_or_regex_search
_check_comparison_types(False, a, b)
File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 1971, in _check_comparison_types
raise TypeError(
TypeError: Cannot compare types 'ndarray(dtype=int32)' and 'str'
My run environment:
(epilogos_052421_v3) areynolds@Earl-Grey epilogos % python --version
Python 3.9.4
(epilogos_052421_v3) areynolds@Earl-Grey epilogos % conda list
# packages in environment at /Users/areynolds/miniconda3/envs/epilogos_052421_v3:
#
# Name Version Build Channel
ca-certificates 2020.12.5 h033912b_0 conda-forge
certifi 2020.12.5 py39h6e9494a_1 conda-forge
click 7.1.2 pypi_0 pypi
cycler 0.10.0 pypi_0 pypi
cython 0.29.23 pypi_0 pypi
kiwisolver 1.3.1 pypi_0 pypi
libcxx 11.1.0 habf9029_0 conda-forge
libffi 3.3 h046ec9c_2 conda-forge
matplotlib 3.3.2 pypi_0 pypi
natsort 7.1.1 pypi_0 pypi
ncls 0.0.57 pypi_0 pypi
ncurses 6.2 h2e338ed_4 conda-forge
numpy 1.19.2 pypi_0 pypi
openssl 1.1.1k h0d85af4_0 conda-forge
pandas 1.1.3 pypi_0 pypi
patsy 0.5.1 pypi_0 pypi
pillow 8.2.0 pypi_0 pypi
pip 21.1.2 pyhd8ed1ab_0 conda-forge
pyparsing 2.4.7 pypi_0 pypi
pyranges 0.0.97 pypi_0 pypi
pyrle 0.0.32 pypi_0 pypi
python 3.9.4 h9133fd0_0_cpython conda-forge
python-dateutil 2.8.1 pypi_0 pypi
python_abi 3.9 1_cp39 conda-forge
pytz 2021.1 pypi_0 pypi
readline 8.1 h05e3726_0 conda-forge
scipy 1.5.2 pypi_0 pypi
setuptools 49.6.0 py39h6e9494a_3 conda-forge
six 1.16.0 pypi_0 pypi
sorted-nearest 0.0.33 pypi_0 pypi
sqlite 3.35.5 h44b9ce1_0 conda-forge
statsmodels 0.12.0 pypi_0 pypi
tabulate 0.8.9 pypi_0 pypi
tk 8.6.10 h0419947_1 conda-forge
tzdata 2021a he74cb21_0 conda-forge
wheel 0.36.2 pyhd3deb0d_0 conda-forge
xz 5.2.5 haf1e3a3_1 conda-forge
zlib 1.2.11 h7795811_1010 conda-forge
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.