Giter Site home page Giter Site logo

orion's People

Contributors

dependabot[bot] avatar kstathou avatar zacoppotamus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

orion's Issues

Metrics: Comparative advantage

Measure the RCA for Fields of Studies on country and institution level. Do it on an annual basis (2011-2020), for countries and institutions.

Create an ORM for it too.

Collect all the biorxiv paper through MAG

It's possible to query MAG with J.JN=‘bioarxiv’ and fetch bioRxiv papers. Test if it's possible to get all of them. This would eliminate various hooks:

  • Manually get the data dump from rxivist.
  • Have a second ORM.
  • Filter papers that are not published on bioRxiv.
  • Remove all the bioRxiv tables.

Clean lgtm flags

There's some dead code all over the place that was caught with lgtm. It's mainly unused import so this should be a straightforward improvement.

Country collaboration by year

Update the country collaboration task to reflect changes in the MAG parser. As a bonus, create an annual country collaboration network.

Transform text data to vectors

Getting the document (ie abstract) vector of every paper is a core task as it will be used in at least two downstream tasks:

  • Semantic search
  • Topic modelling or clustering.
    I will use a pre-trained model for this (for example BERT or Universal Sentence Encoder).

Note: This can be a batch task because the models are pretrained.

Add contribution guidelines

  • Open an issue to discuss
  • Always branch out of dev
  • List the required parts: documentation, operator, package, orm

Change MAG query to two-month periods

This will increase the number of calls by a bit but it will overcome throttling issues when querying with very broad fields of study such as medicine or artificial intelligence

Explain what operators do

As the number of tasks in the orion.py DAG increases, it becomes more difficult for newcomers to understand what's going on.

Write a readme briefly explaining every operator. Add a snapshot of the DAG too.

Airflow configurations

  • Fetch data from AWS
  • Run DAG on AWS
  • Properly set hooks and sensors where appropriate
  • Send email on failure
  • Use Xcom to push S3 bucket and file names to the next task.
  • Test DAGs
  • jinja format + @apply_defaults
  • Consider changes in the operators' setup.
  • Consider fetching variables from .env instead of using misctools.py

Collect FoS metadata

Collect the hierarchy level, child nodes and parents of its field of study in our biorRxiv.

Update values when adding papers

The values of the metrics (rca, interdisciplinarity), as well as the FoS frequency, have to be updated when adding new papers in the DB.

Topic filtering

Devise a strategy to filter topics. This task will inform all the metrics we are calculating.

Document the repository

  • How to use config file (for environmental variables, dag config)
  • How to register for a Microsoft Academic account
  • Create and connect to a postgresql db (local and AWS + psql)
  • postgresql db: changes in config and pg_hba.conf
  • How to get a Google Places API key.
  • How to run the whole DAG.
  • How to install airflow
  • Travis build
  • Gender API key
  • Add https://lgtm.com/
  • Text2vector: Transformers (maybe link to blog?)

Data update & prod/dev DB

Update bioRxiv data and specify production and development databases.

  • Add prod/dev db URIs in config.
  • Create prod/dev DBs on RDS.
  • Change input URI in DAG.
  • Modify MagCollectionOperator to get only N papers when running on dev mode.
  • Run DAG on dev db.
  • Run DAG on prod db.

Fetch data from AWS

Airflow currently fetches data from my local PostgreSQL database. Change the DB ENDPOINT to AWS so that the remote data is accessed/stored.

EDA bioRxiv

Explore the bioRxiv data. Focus on fields from MAG so that the work can be used in other projects too. The output should be a long airflow operator.

Add s3 utils

Add utility functions to store and load objects from S3.

Update readme

Update data sources in the main readme file and add a link to the medium blog.

Patch: Mag parser

Forgot to update a variable name in a logging statement and the parser crashes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.