The orion from orion-search

Fast document vectorisation method

Add tfidf+SVD as a fast and reliable way of transforming the abstracts to vectors.

Reduce number of nodes in the hierarchy

Need to reduce the number of fields of study shown in the network visualisation. We will keep those occurring in more than N papers.

External data: Country-level statistics

Get country-level data on GDP, income and other variables. Create a list and discuss them with Zac and Lilia.

Metrics: Comparative advantage

Measure the RCA for Fields of Studies on country and institution level. Do it on an annual basis (2011-2020), for countries and institutions.

Create an ORM for it too.

Update data and filter irrelevant mag entities

Update the database and filter MAG entities if their DOI is not in the bioRxiv dataset.

Add MAG data collection pipeline

Metrics: Research interdisciplinarity

Measure how interdisciplinary a country or institute is based on the Fields of Study.

This will be in the same task with #28

Add ORMs for the database

Add the ORMs for Microsoft Academic, bioRxiv and the geocoded affiliations.

Split affiliations to industry and academia

Metrics: Gender diversity

Create a table as in #52 and #49 for gender diversity.

Create a notebook that interacts with our API

Over at https://observablehq.com/

Collect all the biorxiv paper through MAG

It's possible to query MAG with J.JN=‘bioarxiv’ and fetch bioRxiv papers. Test if it's possible to get all of them. This would eliminate various hooks:

Manually get the data dump from rxivist.
Have a second ORM.
Filter papers that are not published on bioRxiv.
Remove all the bioRxiv tables.

Add ISO country codes and continents

Clean lgtm flags

There's some dead code all over the place that was caught with lgtm. It's mainly unused import so this should be a straightforward improvement.

Add geolocation pipeline

Country collaboration by year

Update the country collaboration task to reflect changes in the MAG parser. As a bonus, create an annual country collaboration network.

Transform text data to vectors

Getting the document (ie abstract) vector of every paper is a core task as it will be used in at least two downstream tasks:

Semantic search
Topic modelling or clustering.
I will use a pre-trained model for this (for example BERT or Universal Sentence Encoder).

Note: This can be a batch task because the models are pretrained.

Add derivative tables in DAG

Add contribution guidelines

Open an issue to discuss
Always branch out of dev
List the required parts: documentation, operator, package, orm

Change target of topic filtering outcome and measure citation sum and paper count

Use a dynamically filled table in postgres.

Change MAG query to two-month periods

This will increase the number of calls by a bit but it will overcome throttling issues when querying with very broad fields of study such as medicine or artificial intelligence

Dimensionality reduction of the document vectors

Albert produces 768D dense vectors. We have to reduce this to 2D in order to visualise it.

Metrics: Investigate use of Altmetrics

Investigate if it would be helpful to add Altmetric data.

Explain what operators do

As the number of tasks in the orion.py DAG increases, it becomes more difficult for newcomers to understand what's going on.

Write a readme briefly explaining every operator. Add a snapshot of the DAG too.

Reorganise operators in directories

Change `doc_vectors` type to ARRAY

Instead of storing an array as TEXT, store it as it is.

Find authors' gender

Use the Gender API for the same reasons stated in my previous work.

I should use the author names from MAG in order to make the tool more generalisable.

Airflow configurations

Fetch data from AWS
Run DAG on AWS
Properly set hooks and sensors where appropriate
Send email on failure
Use Xcom to push S3 bucket and file names to the next task.
Test DAGs
jinja format + @apply_defaults
Consider changes in the operators' setup.
Consider fetching variables from .env instead of using misctools.py

Faiss index serialisation

Change serialisation method so that it's possible to read the index from S3.

Pass topic list before measuring RCA

Retrieve relevant documents

Find the most relevant abstract vectors (discussed in #1) for a given query. The search has to be fast. Some leads:

Collect FoS metadata

Collect the hierarchy level, child nodes and parents of its field of study in our biorRxiv.

Update values when adding papers

The values of the metrics (rca, interdisciplinarity), as well as the FoS frequency, have to be updated when adding new papers in the DB.

Update docstrings in all operators

Provide a clear and current description of what every operator does, from where it receives data and where it stores the outputs.

Complete blog on Orion's backend and add it on the main README

Add DB schema

Cluster document vectors on country level

Topic filtering

Devise a strategy to filter topics. This task will inform all the metrics we are calculating.

Document the repository

Document vectorisation with USE

Use the Universal Sentence Encoder model to vectorise abstracts. Try both lite and large versions.

Data update & prod/dev DB

Update bioRxiv data and specify production and development databases.

Add prod/dev db URIs in config.
Create prod/dev DBs on RDS.
Change input URI in DAG.
Modify MagCollectionOperator to get only N papers when running on dev mode.
Run DAG on dev db.
Run DAG on prod db.

orion-search / orion Goto Github PK

orion's People

Contributors

Stargazers

Watchers

Forkers

orion's Issues

Recommend Projects

Recommend Topics

Recommend Org