elife-asu / ecg Goto Github PK
View Code? Open in Web Editor NEWPulling information from biological databases, and converting it into easy to use gmls for network science.
License: MIT License
Pulling information from biological databases, and converting it into easy to use gmls for network science.
License: MIT License
This is an unlisted dependency, and there may be other issues.
Chromedriver is hosted here: https://sites.google.com/chromium.org/driver/
I don't like that it ships executables, and doesn't have binaries or installers.
Do we NEED this depedency or can we find a better/simpler/more modern solution? (e.g do we need to user chromedriver at all or can we use another driver?) Does the version of Chromedriver matter?
Adding driver is only one of multiple problems.
Line 10 in a07d0c3
Write the list of all possible arguments at the beginning of description for the script as it is done under the function "scape_domain".
For example, for domain,
'Eukaryota'
'Bacteria' (wait 45 seconds)
'Archaea'
'Plasmids' UNTESTED
'Viruses' UNTESTED
'GFragment' (gene fragments) UNTESTED
## Alpha2
'*Microbiome' (metagenome)
'cell' (metagenome- cell enrichment) UNTESTED
'sps' (metagenome- single particle sort) UNTESTED
'Metatranscriptome' UNTESTED
jgi_ko_edit.py
to jgi.py
and delete or jgi.py
(we'll handle the diffs when we merge)_
or unsafe
. If it expected to be public just add "safety" cavets to docstrings.example
directory with very detailed instructions, for pulling just EC data and for EC + KO dataBio
tqdm
docopt
Add function to only scrape (meta)genomes which have a modified date which is different than the date of the modified date in the user's local database.
Line 71 in 8bc8906
there are files where there are no "enzymes"
Right now it is impractical to feed in more than one organismal_url for scraping through the CLI. In order to fix this, we should add a function that can parse an input file that contains many URLs. I'm thinking a JSON file, because everything is JSONs right now. This json would just be a list ['url_1','url_2',...]
Consolidate entire CLI to script/ecg.py
to make the commands more straightforward.
To convert EC lists to graphs
docopt is fragile and not very flexible (although it's easy to write and is easy to understand). We want to switch to using argparse, which provides more robust CLI functionality and is a python standard library package.
Add fields for:
left_elements
right_elements
element_conservation
to the master.json
file during KEGG retrieval.
Add above fields after KEGG retrieval as well.
Add element
field to the compound.json
too?
@bkaras1 noticed an issue with metagenome scraping https://elife-asu.slack.com/archives/CHLUF07D1/p1574279358002100
Let's fix it.
Brand-new users might not know that dependencies aren't automatically installed, and might not feel confident choosing from a list of local installation options... all of which assume both (1) a preferred and pre-existing directory structure, and (2) successful execution of unstated installation steps. It could be smart to add a "quick setup" section for novice users near the top of the README. e.g....
Open your terminal or other Unix/Linux command-line interface. Use it to navigate to your desktop, documents, or other folder in which you tend to store projects (e.g. cd Desktop/
). Then, copy+paste into the terminal each of the following lines:
mkdir ecgHub
pip install docopt; pip install tqdm; pip install biopython; pip install selenium; pip install beautifulsoup4; pip install networkx
cd ecgHub
git clone https://github.com/ELIFE-ASU/ecg
cd ecg
pip install -e
mkdir mydata
The command import ecg
should now work for any Python scripts or Jupyter Notebooks created and stored in the top-level ecg
directory (i.e. ecgHub/ecg
). Files not used by ecg
or generated with ecg
, but which are relevant or occasionally needed in scripts which import ecg
, can then be stored in the ecgHub
folder. (manuscripts, notes, templates, auxiliary csvs, etc.)
Line 122 in d014942
self.homepage_url is saved as a string "https://img.jgi.doe.gov/cgi-bin/m/main.cgi" with the quotation marks as default homepage url. This gives an error message. A temporary fix would be to replace with line below, though it hard codes the homepage url.
full_url = "https://img.jgi.doe.gov/cgi-bin/m/main.cgi{}".format(html_suffix)
When update is called, the update
field should be updated in both the version.json and master.json. But it should not include the field "lists"
and the corresponding full lists of entires from current lists. Right now it is doing that and it needs to be removed
if domain='cell'
need to reference 'Taxon Object ID'
instead of 'Taxon ID'
For example, the progress bar is for enzyme, pathways or other categories.
for example...
Jgi.update_domain(self, path, domain, database='all',
assembly_types = ['assembled','unassembled','both'])
...would cause an existing directory (as specified by path
) to be updated with all domain organisms which are part of the JGI domain, but not currently stored in the directory. The method should check to make sure the directory of the path matches the directory of the domain as a weak way to verify the user isn't trying to update a directory from a different domain. Of course, the user could still update the directory with organisms from a different database (eg jgi
instead of all
) than what it was originally, or with different assembly types.
Doesn't have outer level "current", "updates", and "original" after updating.
Expected to have all three with correct fields and values.
The check within Jgi.scrape_domain()
if domain in untested:
warnings.warn("This domain is untested for this function.")
causes a new tqdm progress bar to appear in jupyter every iteration through loop used to scrape organisms. Need suggestions for preventing this behavior.
for example,
make both of --run_pipeline and --rp available for --run_pipeline
Line 293 in a07d0c3
Default argument for the database is set as 'all' but the documentation under the function says default is 'jgi'.
Add in function to check if an EC number was changed in a KEGG update. There could be a field in KEGG that lists if there are equivalent EC numbers, or we can do this manually, because there should not be two different ECs that both have the same set of reactions that they catalyze.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.