The ecg from elife-asu

JGI Requires Chromedriver

This is an unlisted dependency, and there may be other issues.

Chromedriver is hosted here: https://sites.google.com/chromium.org/driver/

I don't like that it ships executables, and doesn't have binaries or installers.

Do we NEED this depedency or can we find a better/simpler/more modern solution? (e.g do we need to user chromedriver at all or can we use another driver?) Does the version of Chromedriver matter?

Adding driver is only one of multiple problems.

Provide the list of all possible arguments

ecg/ecg/jgi.py

Line 10 in a07d0c3

Arguments:

Write the list of all possible arguments at the beginning of description for the script as it is done under the function "scape_domain".

For example, for domain,

Alpha

        'Eukaryota'
        'Bacteria' (wait 45 seconds)
        'Archaea'
        'Plasmids' UNTESTED
        'Viruses' UNTESTED
        'GFragment' (gene fragments) UNTESTED
        ## Alpha2
        '*Microbiome' (metagenome)
        'cell' (metagenome- cell enrichment) UNTESTED
        'sps' (metagenome- single particle sort) UNTESTED
        'Metatranscriptome' UNTESTED

Implement A la carte updating of JGI data

Rename jgi_ko_edit.py to jgi.py and delete or jgi.py (we'll handle the diffs when we merge)
Change function names to reflect the fact that they'll actually be public now (e.g. remove leading _ or unsafe. If it expected to be public just add "safety" cavets to docstrings.
Make sure docstrings are clear and updated (use ChatGPT as needed).
Finalize documentation of how to use A la carte functionality.
Include a working example in the example directory with very detailed instructions, for pulling just EC data and for EC + KO data

Add information about pre-required packages to README.md

Bio
tqdm
docopt

Update local omic database with only new and modified genomes

Add function to only scrape (meta)genomes which have a modified date which is different than the date of the modified date in the user's local database.

Add lines to deal with exception for KeyError

ecg/ecg/ecg.py

Line 71 in 8bc8906

for ec in biosystem_json['enzymes']:

there are files where there are no "enzymes"

Add ability to read in a json of organism_urls from command line

Right now it is impractical to feed in more than one organismal_url for scraping through the CLI. In order to fix this, we should add a function that can parse an input file that contains many URLs. I'm thinking a JSON file, because everything is JSONs right now. This json would just be a list ['url_1','url_2',...]

move CLI out of kegg.py, jgi.py, ecg.py and into script/ecg.py

Consolidate entire CLI to script/ecg.py to make the commands more straightforward.

Documentation about post-processing of KEGG and JGI data to construct graphs

To convert EC lists to graphs

rxn_edges.json
rxn_detalied_json_dir
genome_rxn_list

Change CLI from docopt to argparse

docopt is fragile and not very flexible (although it's easy to write and is easy to understand). We want to switch to using argparse, which provides more robust CLI functionality and is a python standard library package.

add element info during/after KEGG retrieval

Add fields for:
left_elements
right_elements
element_conservation
to the master.json file during KEGG retrieval.
Add above fields after KEGG retrieval as well.
Add element field to the compound.json too?

Fix metagenome JGI scraping

@bkaras1 noticed an issue with metagenome scraping https://elife-asu.slack.com/archives/CHLUF07D1/p1574279358002100

Let's fix it.

add to README.md: simplified stepwise installation instructions

Brand-new users might not know that dependencies aren't automatically installed, and might not feel confident choosing from a list of local installation options... all of which assume both (1) a preferred and pre-existing directory structure, and (2) successful execution of unstated installation steps. It could be smart to add a "quick setup" section for novice users near the top of the README. e.g....

Open your terminal or other Unix/Linux command-line interface. Use it to navigate to your desktop, documents, or other folder in which you tend to store projects (e.g. cd Desktop/). Then, copy+paste into the terminal each of the following lines:

mkdir ecgHub
pip install docopt; pip install tqdm; pip install biopython; pip install selenium; pip install beautifulsoup4; pip install networkx
cd ecgHub
git clone https://github.com/ELIFE-ASU/ecg
cd ecg
pip install -e
mkdir mydata

The command import ecg should now work for any Python scripts or Jupyter Notebooks created and stored in the top-level ecg directory (i.e. ecgHub/ecg). Files not used by ecg or generated with ecg, but which are relevant or occasionally needed in scripts which import ecg, can then be stored in the ecgHub folder. (manuscripts, notes, templates, auxiliary csvs, etc.)

Fix full_url to correctly save as a url

ecg/ecg/jgi.py

Line 122 in d014942

full_url = self.homepage_url+html_suffix

self.homepage_url is saved as a string "https://img.jgi.doe.gov/cgi-bin/m/main.cgi" with the quotation marks as default homepage url. This gives an error message. A temporary fix would be to replace with line below, though it hard codes the homepage url.

full_url = "https://img.jgi.doe.gov/cgi-bin/m/main.cgi{}".format(html_suffix)

updating writes full lists to the version update field in master.json

When update is called, the update field should be updated in both the version.json and master.json. But it should not include the field "lists" and the corresponding full lists of entires from current lists. Right now it is doing that and it needs to be removed

cell fragments have taxon object ID, not taxon ID

if domain='cell' need to reference 'Taxon Object ID' instead of 'Taxon ID'

ADD more information for the progress bars for downloading KEGG and other db

For example, the progress bar is for enzyme, pathways or other categories.

Add method to update data from a JGI domain

for example...

Jgi.update_domain(self, path, domain, database='all', 
                                 assembly_types = ['assembled','unassembled','both'])

...would cause an existing directory (as specified by path) to be updated with all domain organisms which are part of the JGI domain, but not currently stored in the directory. The method should check to make sure the directory of the path matches the directory of the domain as a weak way to verify the user isn't trying to update a directory from a different domain. Of course, the user could still update the directory with organisms from a different database (eg jgi instead of all) than what it was originally, or with different assembly types.

add ability to make KEGG graph and reaction list directly (no jgi jsons required)

updating doesn't correctly rewrite version.json

Doesn't have outer level "current", "updates", and "original" after updating.

Expected to have all three with correct fields and values.

raising warning before scraping causes multiple tqdm bars in jupyter

The check within Jgi.scrape_domain()

if domain in untested:
            warnings.warn("This domain is untested for this function.")

causes a new tqdm progress bar to appear in jupyter every iteration through loop used to scrape organisms. Need suggestions for preventing this behavior.

Provide shorter version of options in CLI

for example,

make both of --run_pipeline and --rp available for --run_pipeline

default for the database

ecg/ecg/jgi.py

Line 293 in a07d0c3

def scrape_domain(self, path, domain, database='all',

Default argument for the database is set as 'all' but the documentation under the function says default is 'jgi'.

Check if an EC number was changed/updated

Add in function to check if an EC number was changed in a KEGG update. There could be a field in KEGG that lists if there are equivalent EC numbers, or we can do this manually, because there should not be two different ECs that both have the same set of reactions that they catalyze.

elife-asu / ecg Goto Github PK

ecg's People

Contributors

Stargazers

Watchers

Forkers

ecg's Issues

Alpha

Recommend Projects

Recommend Topics

Recommend Org