orion-search / orion Goto Github PK
View Code? Open in Web Editor NEWA data collection, enrichment and analysis system used in Orion search.
Home Page: https://www.orion-search.org/
License: MIT License
A data collection, enrichment and analysis system used in Orion search.
Home Page: https://www.orion-search.org/
License: MIT License
Add tfidf+SVD as a fast and reliable way of transforming the abstracts to vectors.
Need to reduce the number of fields of study shown in the network visualisation. We will keep those occurring in more than N papers.
Get country-level data on GDP, income and other variables. Create a list and discuss them with Zac and Lilia.
Measure the RCA for Fields of Studies on country and institution level. Do it on an annual basis (2011-2020), for countries and institutions.
Create an ORM for it too.
Update the database and filter MAG entities if their DOI
is not in the bioRxiv dataset.
Measure how interdisciplinary a country or institute is based on the Fields of Study.
This will be in the same task with #28
Add the ORMs for Microsoft Academic, bioRxiv and the geocoded affiliations.
Over at https://observablehq.com/
It's possible to query MAG with J.JN=‘bioarxiv’
and fetch bioRxiv papers. Test if it's possible to get all of them. This would eliminate various hooks:
There's some dead code all over the place that was caught with lgtm. It's mainly unused import so this should be a straightforward improvement.
Update the country collaboration task to reflect changes in the MAG parser. As a bonus, create an annual country collaboration network.
Getting the document (ie abstract) vector of every paper is a core task as it will be used in at least two downstream tasks:
Note: This can be a batch task because the models are pretrained.
dev
Use a dynamically filled table in postgres.
This will increase the number of calls by a bit but it will overcome throttling issues when querying with very broad fields of study such as medicine or artificial intelligence
Albert
produces 768D dense vectors. We have to reduce this to 2D in order to visualise it.
Investigate if it would be helpful to add Altmetric data.
As the number of tasks in the orion.py
DAG increases, it becomes more difficult for newcomers to understand what's going on.
Write a readme briefly explaining every operator. Add a snapshot of the DAG too.
Instead of storing an array as TEXT
, store it as it is.
Use the Gender API for the same reasons stated in my previous work.
I should use the author names from MAG in order to make the tool more generalisable.
@apply_defaults
.env
instead of using misctools.py
Change serialisation method so that it's possible to read the index from S3.
Find the most relevant abstract vectors (discussed in #1) for a given query. The search has to be fast. Some leads:
Collect the hierarchy level, child nodes and parents of its field of study in our biorRxiv.
The values of the metrics (rca, interdisciplinarity), as well as the FoS frequency, have to be updated when adding new papers in the DB.
Provide a clear and current description of what every operator does, from where it receives data and where it stores the outputs.
Devise a strategy to filter topics. This task will inform all the metrics we are calculating.
Use the Universal Sentence Encoder model to vectorise abstracts. Try both lite and large versions.
Update bioRxiv data and specify production and development databases.
MagCollectionOperator
to get only N papers when running on dev
mode.dev
db.prod
db.Airflow currently fetches data from my local PostgreSQL database. Change the DB ENDPOINT to AWS so that the remote data is accessed/stored.
Create a country collaboration table with the format country_a | country_b | edge_weight
.
Explore the bioRxiv data. Focus on fields from MAG so that the work can be used in other projects too. The output should be a long airflow operator.
From World Development Indicators to Sustainable Development Goals, World Bank is probably the best source to collect up-to-date, country-level statistics.
Add utility functions to store and load objects from S3.
Update data sources in the main readme file and add a link to the medium blog.
Forgot to update a variable name in a logging
statement and the parser crashes.
We're now bundling the author affiliations, however, there's evidence suggesting they could be unique for each paper.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.