Giter Site home page Giter Site logo

harmonydata / harmony Goto Github PK

View Code? Open in Web Editor NEW
7.0 6.0 12.0 23.7 MB

The Harmony Python library: a research tool for psychologists to harmonise data and questionnaire items. Open source.

Home Page: https://harmonydata.ac.uk

License: MIT License

Python 55.90% Jupyter Notebook 44.10%
anxiety data-harmonisation data-harmonization data-science depression harmonisation harmonization harmony mental-health-catalogue natural-language-processing

harmony's Introduction

The Harmony Project logo

🌐 harmonydata.ac.uk Harmony | LinkedIn Harmony | X Harmony | Instagram Harmony | Facebook Harmony | YouTube

Harmony on Twitter

Harmony Python library

PyPI package my badge License tests Current Release Version pypi Version version number PyPi downloads forks docker

You can also join our Discord server! If you found Harmony helpful, you can leave us a review!

What does Harmony do?

  • Psychologists and social scientists often have to match items in different questionnaires, such as "I often feel anxious" and "Feeling nervous, anxious or afraid".
  • This is called harmonisation.
  • Harmonisation is a time consuming and subjective process.
  • Going through long PDFs of questionnaires and putting the questions into Excel is no fun.
  • Enter Harmony, a tool that uses natural language processing and generative AI models to help researchers harmonise questionnaire items, even in different languages.

Quick start with the code

Read our guide to contributing to Harmony here or read CONTRIBUTING.md.

You can run the walkthrough Python notebook in Google Colab with a single click: Open In Colab

You can also download an R markdown notebook to run in R Studio: Open In R Studio

You can run the walkthrough R notebook in Google Colab with a single click: Open In Colab

The Harmony Project

Harmony is a tool using AI which allows you to compare items from questionnaires and identify similar content. You can try Harmony at https://harmonydata.ac.uk/app and you can read our blog at https://harmonydata.ac.uk/blog/.

Who to contact?

You can contact Harmony team at https://harmonydata.ac.uk/, or Thomas Wood at https://fastdatascience.com/.

🖥 Installation instructions (video)

Installing Harmony

🖱 Looking to try Harmony in the browser?

Visit: https://harmonydata.ac.uk/app/

You can also visit our blog at https://harmonydata.ac.uk/

✅ You need Tika if you want to extract instruments from PDFs

Download and install Java if you don't have it already. Download and install Apache Tika and run it on your computer https://tika.apache.org/download.html

java -jar tika-server-standard-2.3.0.jar

Requirements

You need a Windows, Linux or Mac system with

  • Python 3.8 or above
  • the requirements in requirements.txt
  • Java (if you want to extract items from PDFs)
  • Apache Tika (if you want to extract items from PDFs)

🖥 Installing Harmony Python package

You can install from PyPI.

pip install harmonydata

Loading all models

Harmony uses spaCy to help with text extraction from PDFs. spaCy models can be downloaded with the following command in Python:

import harmony
harmony.download_models()

Matching example instruments

instruments = harmony.example_instruments["CES_D English"], harmony.example_instruments["GAD-7 Portuguese"]
questions, similarity, query_similarity, new_vectors_dict = harmony.match_instruments(instruments)

How to load a PDF, Excel or Word into an instrument

harmony.load_instruments_from_local_file("gad-7.pdf")

Optional environment variables

As an alternative to downloading models, you can set environment variables so that Harmony calls spaCy on a remote server. This is only necessary if you are making a server deployment of Harmony.

  • HARMONY_SPACY_PATH - determines where model files are stored. Defaults to HOME DIRECTORY/harmony
  • HARMONY_DATA_PATH - determines where data files are stored. Defaults to HOME DIRECTORY/harmony
  • HARMONY_NO_PARSING - set to 1 to import a lightweight variant of Harmony which doesn't support PDF parsing.
  • HARMONY_NO_MATCHING - set to 1 to import a lightweight variant of Harmony which doesn't support matching.

Loading instruments from PDFs

If you have a local file, you can load it into a list of Instrument instances:

from harmony import load_instruments_from_local_file
instruments = load_instruments_from_local_file("gad-7.pdf")

Matching instruments

Once you have some instruments, you can match them with each other with a call to match_instruments.

from harmony import match_instruments
all_questions, similarity, query_similarity, new_vectors_dict = match_instruments(instruments)
  • all_questions is a list of the questions passed to Harmony, in order.
  • similarity is the similarity matrix returned by Harmony.
  • query_similarity is the degree of similarity of each item to an optional query passed as argument to match_instruments.

⇗⇗ Using a different vectorisation function

Harmony defaults to sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (HuggingFace link). However you can use other sentence transformers from HuggingFace by setting the environment HARMONY_SENTENCE_TRANSFORMER_PATH before importing Harmony:

export HARMONY_SENTENCE_TRANSFORMER_PATH=sentence-transformers/distiluse-base-multilingual-cased-v2

Using OpenAI or other LLMs for vectorisation

Any word vector representation can be used by Harmony. The below example works for OpenAI's text-embedding-ada-002 model as of July 2023, provided you have create a paid OpenAI account. However, since LLMs are progressing rapidly, we have chosen not to integrate Harmony directly into the OpenAI client libraries, but instead allow you to pass Harmony any vectorisation function of your choice.

import numpy as np
from harmony import match_instruments_with_function, example_instruments
from openai import OpenAI

client = OpenAI()
model_name = "text-embedding-ada-002"
def convert_texts_to_vector(texts):
    vectors = client.embeddings.create(input = texts, model=model_name).data
    return np.asarray([vectors[i].embedding for i in range(len(vectors))])
instruments = example_instruments["CES_D English"], example_instruments["GAD-7 Portuguese"]
all_questions, similarity, query_similarity, new_vectors_dict = match_instruments_with_function(instruments, None, convert_texts_to_vector)

💻 Do you want to run Harmony in your browser locally?

Download and install Docker:

Open a Terminal and run

docker run -p 8000:8000 -p 3000:3000 harmonydata/harmonylocal

Then go to http://localhost:3000 in your browser.

Looking for the Harmony API?

Visit: https://github.com/harmonydata/harmonyapi

Docker images

If you are a Docker user, you can run Harmony from a pre-built Docker image.

Contributing to Harmony

If you'd like to contribute to this project, you can contact us at https://harmonydata.ac.uk/ or make a pull request on our Github repository. You can also raise an issue.

Developing Harmony

🧪 Automated tests

Test code is in tests/ folder using unittest.

The testing tool tox is used in the automation with GitHub Actions CI/CD. Since the PDF extraction also needs Java and Tika installed, you cannot run the unit tests without first installing Java and Tika. See above for instructions.

🧪 Use tox locally

Install tox and run it:

pip install tox
tox

In our configuration, tox runs a check of source distribution using check-manifest (which requires your repo to be git-initialized (git init) and added (git add .) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.

The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:

tox -e py39

Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally.

⚙️Continuous integration/deployment to PyPI

This package is based on the template https://pypi.org/project/example-pypi-package/

This package

  • uses GitHub Actions for both testing and publishing
  • is tested when pushing master or main branch, and is published when create a release
  • includes test files in the source distribution
  • uses setup.cfg for version single-sourcing (setuptools 46.4.0+)

⚙️Re-releasing the package manually

The code to re-release Harmony on PyPI is as follows:

source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*

‎😃💁 Who worked on Harmony?

Harmony is a collaboration project between Ulster University, University College London, the Universidade Federal de Santa Maria, and Fast Data Science. Harmony is funded by Wellcome as part of the Wellcome Data Prize in Mental Health.

The core team at Harmony is made up of:

📜 License

MIT License. Copyright (c) 2023 Ulster University (https://www.ulster.ac.uk)

📜 How do I cite Harmony?

McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffman, M., Wood, T.A., Harmony [Computer software], Version 1.0, accessed at https://harmonydata.ac.uk/app. Ulster University (2023)

harmony's People

Contributors

0x48piraj avatar evewcheng avatar olikelly00 avatar ollylucl avatar olp-cs avatar shahid-0 avatar sourface94 avatar woodthom2 avatar zaironjacobs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

harmony's Issues

Allow user to process data

Description

At the end of the Harmony user journey when the user exports results as xlsx ( https://youtu.be/CqAsrY74zNM ), can we generate either some Python code or a Colab or Jupyter notebook allowing them to analyse their datasets?

Rationale

This would be a nice feature to streamline the whole data harmonisation process.

We could offer the option to do it in the Web UI but by helping the user complete it on their machine we bypass some confidentiality issues (if user is not allowed to upload raw data to internet)

An open question: how can the user do a statistical analysis and incorporate the Harmony scores? Do we make a new variable which is e.g. 0.65 × Instr1Ques1 + 0.33 × Instr2Ques5 etc... and then do statistical tests on it???

“marital status” and “mother status” not detected

Hi Thomas,

I just noticed an issue with something. Somehow when I try to upload the attached files Harmony doesn’t detect the two items on “martial status” and “mother status”. If there is a quick fix before pitch tomorrow it would be great to do it, if not, I’ll think of something to hide it.

files are:
MCS items english.zip
MCS items english.docx

https://mail.google.com/mail/u/0/#inbox/FMfcgzGwHLfLhGlSCslRsbpVFHGcKCxD

Thank you.

Bettina

ERROR: Could not build wheels for thinc, which is required to install pyproject.toml-based projects

Description

After cloning the repo I created the python and after that when I tried to install the libraries using pip install -r requirements.txt and pip install . I got this below error:

error: command '/usr/bin/gcc' failed with exit code 1
[end of output]
  
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for thinc
Failed to build thinc
ERROR: Could not build wheels for thinc, which is required to install pyproject.toml-based projects
[end of output]

Environment

OS: Ubuntu 23.04
Python: 3.12.2

Harmony should remove digits if every question starts with a digit

Description

If I upload a CSV file like this, Harmony puts digits at the start of each question

1 I feel nervous
2 I feel afraid

Environment

Web Harmony

How to Reproduce

Make file harmony.csv with content

1 I feel nervous
2 I feel afraid

Upload to web UI

You will see digits at the start of all questions

image

Expected Behavior

Digits should be removed

Doesn't extract data from word file

Description

When I upload word document with survey items they are not all a extracted by Harmony. In the attached file Harmony doesn't read "legal marital status".
MCS all items english.docx

Environment

Provide details regarding the operating system, toolchain, and environment.

How to Reproduce

Expected Behavior

Harmony reads all items on the list

Required python version/bound is not mentioned

Description

When I try to install the library I faced this #24 . I resolved that issue but it lead me to that issue because we didn't mentioned the required (min and max) version of python in setup.py/pyproject.toml.

Allow loading HTML file format

Description

Allow uploading a .html file with load_instruments_from_local_file, it should then remove all HTML tags etc. and create the instruments.

Rationale

Sometimes people have an instrument in HTML format.

Don't match to MHC items if similarity is too low

from BEttina

I just bumped into new Harmony feature and wanted to flag the below:

They are definitely different sentences…(lost my key, found my car) Is it because its not mental health related? I think with the new feature we need to rethink the linking to the catalogue, as the link to eating disorders doesn’t make sense. Is there a way to activate/deactivate when there is no overlap between uploaded items and catalogue?

Remove empty items from MHC

from John: Thomas I think we might need a clean up of the MHC data there shouldn’t be a question in there with no text really?

Integrate new non-spacy Pdf parsing into main Harmony

Description

We have a draft improvement to the PDF parsing logic. This will enable us to eliminate Spacy as a dependency.

The training code is here:
https://github.com/harmonydata/pdf-text-models-amol

The API modification is here
https://github.com/harmonydata/harmonyapi branch nospacy

The modification to the main python library is in

git clone -b updated_files_for_forntend https://github.com/Notysoty/harmony.git 

Please quality control this branch and then merge it into main in all repositories and remove spacy from all requirements.txt and toml files.

Rationale

Pdf extraction needs improvement

Give a similarity function between questionnaires

Description

Eve Cheng has done some experiments with the Word Movers Distance algorithm which gives the distance between two sequences of sentence embeddings.

Can Harmony use this to say that the GAD-7 is e.g. 60% similar to the PHQ-9?

See Colab notebook:
https://github.com/harmonydata/experiments/blob/main/harmony_wmd_experiment.ipynb

We also have a demo of Harmony integrated with external data sources: https://harmonycataloguelookup.azurewebsites.net/

Source code is at: https://github.com/harmonydata/harmony_catalogue_lookup_dash

See mockup

https://github.com/harmonydata/hackathon/blob/main/instrument_level.png

Rationale

The use case would be:

as a research psychologist, I’ve got one small study here, one small study there. Individually they don’t give enough statistical power, but can they do it together? So can we combine Study A and Study B to get enough statistical power for my research question?

Word Movers Distance is a candidate but it's not necessarily how we solve it. It might be too slow for example.

Maybe a simple solution is just to have a threshold and we report the number of questions in Instrument A matching questions in Instrument B at that threshold.

Load instrument from URL

Description

User finds an instrument on the internet in PDF, HTML or other format. Can we allow them to paste the URL into Harmony?

Rationale

Convenient way to ingest instruments into the tool

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.