Giter Site home page Giter Site logo

thelastquestion / nihopioiddata Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.17 MB

NIH opioids research data science medium post

Home Page: https://medium.com/@bryant.d.renaud/opioid-crisis-whats-the-government-doing-about-it-42f5f50fe6e4

License: MIT License

Jupyter Notebook 100.00%

nihopioiddata's Introduction

Opioid crisis -- what's the government doing about it?

Analysis performed in connection with a data science Medium post

Not an official position of the US Govt.

Table of Contents

  1. Installation
  2. Project Motivation
  3. File Descriptions
  4. Results
  5. Licensing, Authors, and Acknowledgements
  6. A note on Text Analysis

Installation

There should be no necessary libraries to run the code here beyond the Anaconda distribution of Python. The code should run with no issues using Python versions 3.*. The libraries used, however, are:

  • downloading data
    • urlib
    • zipfile
    • io
  • wrangling data
    • pandas
    • numpy
  • cleaning data
    • re
  • modeling
    • nltk
    • sklearn
    • statsmodels
  • visualizing
    • pyplot

Project Motivation

I am currently making my wy through an intensive data science certification course. I developed this code and the associated blog post as my first large assignment in the course. It is, of course, also inspired by current events.

File Descriptions

You will need to run the downloadData notebook in order to pull the open data from Federal Reporter and assemble it into one csv.

The opioidResearchData notebook will then read in that csv and perform all the analysis and plot-generation. The other files in this repo include a washington post image for the medium post and the plots generated from this notebook.

Results

Please read the Medium post for details, but a summary of findings is:

  1. Research lags current events — despite the fact that the opioids crisis, in terms of mortality, was well underway in the late ’00s, NIH’s opioid-related research portfolio did not see a spike in funding until much later, starting around 2016. This spike took the form of both an increase in the number of opioids projects awarded as well as dollars awarded for opioids projects.
  2. Using NLP and topic modeling, we observed that studies involving particular patient subgroups are rising in importance. Taken as a whole, however, the plurality of opioids projects over the last 11 fiscal years have tended to involve conditions opioids are prescribed for, HIV and other infections, and dependence and recovery.
  3. While total project costs tend not to vary much by topic area, those projects associated with the road to recovery do seem to be funded at a level around $100,000 more than projects in most other topics. Organizations in certain states tend to have larger project costs, but this may be an artifact of those organizations receiving relatively few opioid project grants.

Licensing, Authors, Acknowledgements

I've included the MIT license here to be as permissive as makes sense. If you find anything here useful, go for it! Of course, if you use the WaPo picture, make sure to attribute properly, as it is not mine to circulate freely.

A note on Text Analysis (Topic Modeling)

LDA (Latent Dirichlet Allocation) is a model used for discovering abstract topics from a collection of documents. These 'latent' topics can be discovered based on observed data -- words in the documents, in this case.

To surface these topics, we create a matrix where each document is a row and each column is a word in the corpus vocabulary. The corpus vocabulary is the universe of words present in any one or more documents in the corpus (minus chosen 'stopwords' that we consider to provide little information).

Each cell of the matrix would be a count of that word in that document. A variant increases the level of sophistication by using a normalized version of these counts known as TF-IDF. TF stands for term-frequency and TF-IDF is term-frequency times inverse document-frequency. In other words, we are not only looking for how often a word appears in a given document, but also whether this particular word is distinct across all the collections of documents (corpus). For example, intuitively we understand that words like "often" or "use" are more frequently encountered, but they are less informative (more semantically-vacuous) if we want to discern a particular topic of a document, as they might be frequently encounter across all text documents in a corpus. On the other hand, words which we will see less frequently across a collection of document might indicate that those words are specific to a particular document, and, therefore, constitute a basis for a topic.

We provide the model with this matrix and how many topics we want it to use. Think of it like a k-means clustering analog. The model will then iterate a specified number of times considering two distributions; 1) which words in the vocabulary are more or less probable to belong in a given topic and 2) which topic is more or less probable for a given document.

The main assumptions, if this all went over your head:

  • each document consists of a mixture of topics, and
  • each topic consists of a collection of words.

nihopioiddata's People

Contributors

thelastquestion avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.