Giter Site home page Giter Site logo

geopephub's Introduction

geopephub

Automatic uploader of GEO metadata projects to PEPhub.

This repository contains geopephub CLI, that enables to automatic upload GEO projects to PEPhub based on date and scheduled automatic uploading using GitHub actions. Additionally, the CLI includes a download command, enabling users to retrieve projects from specifed namespace directly from the PEPhub database. This feature is particularly helpful for downloading all GEO projects at once.

Installation

To install geopephub use this command:

pip install git+https://github.com/pepkit/geopephub.git

Overview:

The geopephub consists of 4 main functionalities:

  1. Queuer: This module comprises functions that scan for new projects in GEO, generate a new cycle for the current run, and log details for each GEO project. It sets the project status to queued and adds it to the database.
  2. Uploader: Checks if there are any queued cycles in the cycle_status table. It retrieves a list of queued projects, executes GEOfetch to download them, and uploads the results to PEPhub database using pepdbagent. geopephub updates the project upload status at each step, allowing for later checks to determine why the upload failed and what occurred.
  3. Checker: This component examines previous cycles, verifies their status, and determines if they were executed. If a cycle was not executed or was unsuccessful, it triggers a rerun. In cases where only one project was unsuccessful, it attempts to upload it again. Additionally, if the cycle does not exist, it creates one using the queuer and uploads files using the uploader.
  4. Downloader: Retrieves projects from the specified namespace, filters by uploading or updating date, and optionally sorts by name or date. It also allows setting a limit on the number of downloaded projects. Projects can be downloaded locally or to a specified S3 bucket. For more information, use the geopephub --help command

More information about these processes can be found in the flowcharts and overview below.

Queuer Flowchart:

%%{init: {'theme':'forest'}}%%
stateDiagram-v2
    s1 --> s2 
    s2 --> s3
    s3 --> s4
    s4 --> s5
    s1: Create a new cycle
    s2: Find GEO updated projects with geofetch Finder
    s3: Add projects to the queue in sample status table
    s4: Change cycle status to queued
    s5: Exit
Loading

Uploader Flowchart:

%%{init: {'theme':'forest'}}%%
stateDiagram-v2
    s1 --> s2 
    s2 --> s3
    s3 --> s4
    s4 --> s5
    s5 --> s6
    s6 --> s7
    s7 --> s8

    s7 --> s2
    s6 --> s3

    s1: Get queued cycles by specifying namespace
    s2: Change status of the cycle
    s2: Get each element from list of queued cycle
    s3: Get each project (GSE) from one cycle
    s4: Change status of the project in project_status_table
    s5: Get specified project by running Geofetcher
    s6: Using pepdbagent add project to the DB
    s6: Change status of the project in project_status_table
    s7: Change status of cycle in cycle_status_table
    s8: Exit
Loading

Checker Flowchart:

graph TD
    A[Choose cycle to check] --> B{Did it run?}
    B -->|Yes| C{Was it successful?}
    B -->|No| D[Run Queuer for the cycle]
    C -->|Yes| E{Did all samples succeed?}
    C -->|No| D

    D --> D1[Run Uploader for the cycle]
    D1 --> K

    E --> |Yes| K[Exit]
    E --> |No| G[Retrieve failed samples]

    G --> H[Run Queuer for samples]
    H --> F[Run Uploader for queued samples]
    
    F --> I[Change samples status in the table]

    I --> J[Change cycle status in the table]

    J --> K[Exit]

Loading

geopephub's People

Contributors

khoroshevskyi avatar nsheff avatar

Watchers

Neal Magee avatar  avatar

geopephub's Issues

Add feature that dumps all geo projects to one zip file

We need to add a new feature to do a routine database dump. So, I think this is doable, but it's not currently set up. The good thing about this is that since we're automatically indexing and parsing GEO daily as new records are added, you can always come back later and just get a new one to rerun on a more up-to-date version.

Exit Code 137 - Out of Memory

While downloading GSE178610 GitHub action failing:

Trying GSE178610 (not a file) as accession...
Skipped 0 accessions. Starting now.
Processing accession 1 of 1: 'GSE178610'
/home/runner/work/_temp/4dfe2662-0b6d-4b4d-84ff-aad23b8dfa34.sh: line 1:  1810 Killed                  python geo_pipeline_script.py --namespace geo_recent --host *** --db *** --user *** -***
Error: Process completed with exit code 137.

I am not sure why this error is happening. I guess, it's because geofetch is creating to huge annotation dict or list.
Do you have any ideas how to handle this error and if it occurs how to skip this accession? @nleroy917 @nsheff

conflict of sqlalchemy versions

Now pepdbagent is using new sqlalchemy (version >=0.2.0), sqlmodel is using old version of sqlaclhemy (<=1.4.0). It leads to conflicts, and uploader is not working.
Two solutions:

  1. Change sqlmodel to sqlalchemy
  2. Use old, modified version of pepdbagent. We already have it:)

Next steps (increment of value)

Tasks before first stable version:

  • Add to geofetch soft file size checker.
  • Add time limit to the gse populator
  • Make script fail resistant

Add upload function of gse by period of time

Now, there is no possibility to check or upload GSE projects using specified start and end date. Add function that will receive start period, end period and 1)check if we have this period in the queue 2) add to the queue 3) add project to the pephub

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.