Giter Site home page Giter Site logo

ecg's People

Contributors

bkaras1 avatar colemathis avatar hbsmith avatar louieslocombe avatar thyamu avatar vmierzej avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

thyamu bkaras1

ecg's Issues

JGI Requires Chromedriver

This is an unlisted dependency, and there may be other issues.

Chromedriver is hosted here: https://sites.google.com/chromium.org/driver/

I don't like that it ships executables, and doesn't have binaries or installers.

Do we NEED this depedency or can we find a better/simpler/more modern solution? (e.g do we need to user chromedriver at all or can we use another driver?) Does the version of Chromedriver matter?

Adding driver is only one of multiple problems.

Provide the list of all possible arguments

ecg/ecg/jgi.py

Line 10 in a07d0c3

Arguments:

Write the list of all possible arguments at the beginning of description for the script as it is done under the function "scape_domain".

For example, for domain,

Alpha

        'Eukaryota'
        'Bacteria' (wait 45 seconds)
        'Archaea'
        'Plasmids' UNTESTED
        'Viruses' UNTESTED
        'GFragment' (gene fragments) UNTESTED
        ## Alpha2
        '*Microbiome' (metagenome)
        'cell' (metagenome- cell enrichment) UNTESTED
        'sps' (metagenome- single particle sort) UNTESTED
        'Metatranscriptome' UNTESTED 

Implement A la carte updating of JGI data

  • Rename jgi_ko_edit.py to jgi.py and delete or jgi.py (we'll handle the diffs when we merge)
  • Change function names to reflect the fact that they'll actually be public now (e.g. remove leading _ or unsafe. If it expected to be public just add "safety" cavets to docstrings.
  • Make sure docstrings are clear and updated (use ChatGPT as needed).
  • Finalize documentation of how to use A la carte functionality.
  • Include a working example in the example directory with very detailed instructions, for pulling just EC data and for EC + KO data

Add ability to read in a json of organism_urls from command line

Right now it is impractical to feed in more than one organismal_url for scraping through the CLI. In order to fix this, we should add a function that can parse an input file that contains many URLs. I'm thinking a JSON file, because everything is JSONs right now. This json would just be a list ['url_1','url_2',...]

Change CLI from docopt to argparse

docopt is fragile and not very flexible (although it's easy to write and is easy to understand). We want to switch to using argparse, which provides more robust CLI functionality and is a python standard library package.

add element info during/after KEGG retrieval

  • Add fields for:
    left_elements
    right_elements
    element_conservation
    to the master.json file during KEGG retrieval.

  • Add above fields after KEGG retrieval as well.

  • Add element field to the compound.json too?

add to README.md: simplified stepwise installation instructions

Brand-new users might not know that dependencies aren't automatically installed, and might not feel confident choosing from a list of local installation options... all of which assume both (1) a preferred and pre-existing directory structure, and (2) successful execution of unstated installation steps. It could be smart to add a "quick setup" section for novice users near the top of the README. e.g....

Open your terminal or other Unix/Linux command-line interface. Use it to navigate to your desktop, documents, or other folder in which you tend to store projects (e.g. cd Desktop/). Then, copy+paste into the terminal each of the following lines:

mkdir ecgHub
pip install docopt; pip install tqdm; pip install biopython; pip install selenium; pip install beautifulsoup4; pip install networkx
cd ecgHub
git clone https://github.com/ELIFE-ASU/ecg
cd ecg
pip install -e
mkdir mydata

The command import ecg should now work for any Python scripts or Jupyter Notebooks created and stored in the top-level ecg directory (i.e. ecgHub/ecg). Files not used by ecg or generated with ecg, but which are relevant or occasionally needed in scripts which import ecg, can then be stored in the ecgHub folder. (manuscripts, notes, templates, auxiliary csvs, etc.)

Add method to update data from a JGI domain

for example...

Jgi.update_domain(self, path, domain, database='all', 
                                 assembly_types = ['assembled','unassembled','both'])

...would cause an existing directory (as specified by path) to be updated with all domain organisms which are part of the JGI domain, but not currently stored in the directory. The method should check to make sure the directory of the path matches the directory of the domain as a weak way to verify the user isn't trying to update a directory from a different domain. Of course, the user could still update the directory with organisms from a different database (eg jgi instead of all) than what it was originally, or with different assembly types.

raising warning before scraping causes multiple tqdm bars in jupyter

The check within Jgi.scrape_domain()

if domain in untested:
            warnings.warn("This domain is untested for this function.")

causes a new tqdm progress bar to appear in jupyter every iteration through loop used to scrape organisms. Need suggestions for preventing this behavior.

default for the database

ecg/ecg/jgi.py

Line 293 in a07d0c3

def scrape_domain(self, path, domain, database='all',

Default argument for the database is set as 'all' but the documentation under the function says default is 'jgi'.

Check if an EC number was changed/updated

Add in function to check if an EC number was changed in a KEGG update. There could be a field in KEGG that lists if there are equivalent EC numbers, or we can do this manually, because there should not be two different ECs that both have the same set of reactions that they catalyze.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.