Giter Site home page Giter Site logo

aslandevbrat / authornamedisambiguation Goto Github PK

View Code? Open in Web Editor NEW

This project forked from brianpulfer/authornamedisambiguation

0.0 0.0 0.0 174.56 MB

An AND implementation for biomedical articles from PubMed database

License: MIT License

Python 96.81% HTML 3.19%

authornamedisambiguation's Introduction

AND - Author Name Disambiguation

Project

This project has the purpose of training a classifier which will be able to tell if two biomedical articles belong to the same author. To do so, a bunch of features are extracted from the two articles, such as e-mail address, affiliation name, city, country, date and so on. The articles are all taken from PubMed.

Various classifier are trained with different algorithms, such as Random Forest and K-NN. Also a Neural Network is implemented.

Environment requirements

Python3, 20GB of free space on disk

Set-up

Step 1:

Download the Text Categorization API from the following link: https://lexsrv2.nlm.nih.gov/LexSysGroup/Projects/tc/2011/release/tc2011.tgz

Step 2:

Unzip the tgz file. This should generate a 'tc2011' folder. Move the tc2011/data folder to the repository under '/main/'. To make the relative test work, copy the folder also under 'test/retrievers_test/jnius/' .

Step 3:

Open the project in an IDE of your preferece (PyCharm suggested) and create a new virtual evironment using a python3 interpreter. Install all the requirements specified in the 'requirements.txt' file.

Step 4:

Install pyjnius according to your operating system. Instructions can be found at https://pyjnius.readthedocs.io/en/stable/index.html

Step 5:

If you are using MacOS, you might want to install certificates in order for the OGER source files to work. To do so: Macintosh HD > Applications > Python3.6 folder (or whatever version of python you're using) > double click on "Install Certificates.command" file.

Dataset

These are the files that make up the AND corpus: (1) 1500_pairs_train.csv (2) 400_pairs_test.csv

These files contain randomly selected pairs of MEDLINE publications sharing an author with the same last name and first initial.

Each file has the following headers:

PMID1/2 - pubmed ID of a first/second publication in a pair. Last_name1/2 - Author last names. Initials1/2 - Author initials. First_name1/2 - Author first names Authorship - YES means that the authors are the same person and NO otherwise.

You should cite this data with the following publication:

Dina Vishnyakova, Raul Rodriguez-Esteban, Fabio Rinaldi, A new approach and gold standard toward author disambiguation in MEDLINE, Journal of the American Medical Informatics Association, , ocz028, https://doi.org/10.1093/jamia/ocz028

https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocz028/5432091

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.