Giter Site home page Giter Site logo

samy-mri / multicare_dataset Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mauro-nievoff/multicare_dataset

0.0 0.0 0.0 1.42 MB

Repo about the MultiCaRe Dataset, with demo notebooks and details about how it was created.

License: Creative Commons Zero v1.0 Universal

Python 1.24% Jupyter Notebook 98.76%

multicare_dataset's Introduction

๐Ÿฅ MultiCaRe - Customized Medical Dataset Creation

The MultiCaRe Dataset is a dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central. Some facts about it:

  • It contains data from over 75K open-access and de-identified case reports, summing up almost 100K clinical cases and more than 135K images
  • Almost 100K patients and 400K medical doctors and researchers were involved in the creation of the articles included in the dataset (see metadata.parquet for citations)
  • The dataset contains images and cases from different medical specialties, such as oncology, cardiology, surgery and pathology

For a detailed insight about the contents of this dataset, please refer to this data article published in Data In Brief.

You can find further details about how the dataset was created by reading the notebooks from the Dataset_Creation_Process folder.

โœ… Create Your Own Dataset

While you may find the whole dataset useful, you probably only need a subset of it based on your specific use case. In order to create a specific subset, first you need to clone this repository and import the MedicalDatasetCreator class:

!git clone https://github.com/mauro-nievoff/MultiCaRe_Dataset

from MultiCaRe_Dataset.multicare import MedicalDatasetCreator

Then, you have to instantiate the MedicalDatasetCreator class. It will take some minutes (5 to 10), as it will import all the MultiCaRe files from Zenodo. The MultiCaRe dataset and any dataset that you create based on it will all be included in same the directory (medical_datasets in this example).

mdc = MedicalDatasetCreator(directory = 'medical_datasets')

Now it's time to define what specific dataset we want by creating filters. In the example below, we are including data that meet all these criteria:

  • male patients who are at least 18 years old
  • cases that contain words such as 'tumor' or 'cancer'
  • cases with images with 'mri' and 'brain' as normalized extractions, with captions that contain words such as 'tumor' or 'mass'
filters = [{'field': 'min_age', 'string_list': ['18']},
           {'field': 'gender', 'string_list': ['Male']},
           {'field': 'case_strings', 'string_list': ['tumor', 'cancer', 'carcinoma'], 'operator': 'any'},
           {'field': 'caption', 'string_list': ['metastasis', 'tumor', 'mass'], 'operator': 'any'},
           {'field': 'normalized_extractions', 'string_list': ['mri', 'brain']}]

Finally, let's create the dataset using these filters and selecting the type of dataset that we want (it can be multimodal, text, image or case_series). This step should not take longer than 2 minutes. You can create as many datasets as you want by using the same mdc instance (you just need to change the name of the dataset and the filters).

mdc.create_dataset(dataset_name = 'male_brain_tumor_dataset', filter_list = filters, dataset_type = 'multimodal')

Done! The dataset is ready to use now.

  • If you want to see a more detailed demo, please refer to this notebook.
  • If you want to create a case series dataset without coding, please refer to this other notebook.

๐Ÿ” First Approach to the Data

Let's see how many cases and images were included:

print(f"Amount of patients: {len(mdc.filtered_cases)}")
print(f"Amount of images: {len(mdc.filtered_image_metadata_df)}")
Amount of patients: 10243
Amount of images: 352

Nice! Now let's take a look at one example image and its corresponding clinical case. We will create a specific function for this purpose.

from IPython.display import Image, Markdown, display

def display_example(mdc, image_index):
  image_path = mdc.filtered_image_metadata_df['file_path'][image_index]
  image_labels = mdc.filtered_image_metadata_df['normalized_extractions'][image_index]

  case_id = mdc.filtered_image_metadata_df['case_id'][image_index]
  for c in mdc.filtered_cases:
    if c['case_id'] == case_id:
      gender = c['gender']
      age = c['age']
      case_text = c['case_text']
      break

  pmcid = case_id.split('_')[0]
  for r in mdc.reference_list:
    if r['pmcid'] == pmcid:
      citation = r
      break

  display(Markdown(f"**Case {case_id}:**"))
  display(f"Gender: {gender}")
  display(f"Age: {age}")
  display(f"Clinical Case:")
  display(case_text)
  display(Markdown(f"**Image:**"))
  display(Image(image_path))
  display(Markdown(f"**Image Labels:**"))
  display(Markdown(f"{image_labels}"))
  display(Markdown(f"**Citation Information:**"))
  display(citation)

This is how the first case of the dataset looks like:

display_example(mdc, image_index = 0)

Case_ID: PMC10018421_01

Gender: Male

Age: 32

Clinical Case: A 32-year-old male presented with a history of intermittent headache for 5 months followed by progressive gait disturbances and blurry vision. His medical and family history was insignificant. His physical examination showed ataxia and cerebellar signs including dysmetria and dysdiadochokinesia in the right upper and lower limbs. Decreased visual acuity was noted in both eyes with bilateral papilledema on ophthalmologic exam. The rest of the examination was otherwise normal. Brain MRI revealed a lesion in the right cerebellar hemisphere. The patient underwent a suboccipital craniotomy. During the procedure, frozen sections were misinterpreted as high-grade malignant glioma. The neurosurgeon decided to proceed with subtotal resection because the risks of gross total resection (GTR) outweighed the benefits considering the aggressive nature of the suspected tumor. Adjuvant therapy with a combination of radiotherapy and chemotherapy with temozolomide was initiated. Two years after surgery, the patient complained of reemergence of symptoms including gait disturbance and morning headaches suggestive of increased intracranial pressure. Brain MRI showed a 4.4 x 4.0 cm ill-defined lesion in the right cerebellum with a mixed cystic-solid pattern (shown in Fig. 1). Cystic components of the tumor were hypointense on T1- and hyperintense on T2-weighted images, whereas the solid components of the tumor were hypointense or isointense on T1- and slightly hyperintense on T2-weighted images. Gadolinium-enhanced T1-weighted images showed marked enhancement of the cystic walls. The unusual clinical course and radiologic features raised suspicion for a more benign tumor than high-grade glioma. The patient underwent another surgery to alleviate the symptoms and reassess the residual lesion. During this surgery, the frozen sections suggested the diagnosis of PXA; therefore, the neurosurgeon conducted GTR. The histopathological and the immunohistochemical studies confirmed the diagnosis of PXA and eliminated the initial diagnosis as high-grade glioma is not consistent with the histopathological studies' findings. The postoperative follow-up was without complications. The patient was closely monitored thereafter.

Image:

head_multicare

Image Labels: ['mri', 'contrast', 'pathological_finding', 'brain', 'right', 't1', 'sagittal']

Citation Information: {'pmcid': 'PMC10018421', 'doi': '10.1159/000529741', 'pmid': '36938309', 'title': 'A Recurrent Pleomorphic Xanthoastrocytoma in the Cerebellum in a Young Adult: A Case Report and Review of the Literature', 'year': '2023', 'authors': ['Ruba Aljendi', 'Mohammed Amr Knifaty', 'Mohammed Amin', 'Souliman Diab', 'Muhammad Saleh Ali', 'Zuheir Alshehabi'], 'journal': 'Case Rep Neurol', 'journal_detail': '2023 Feb 17;15(1):54-62.', 'link': 'https://pubmed.ncbi.nlm.nih.gov/36938309/', 'license': 'CC BY-NC'}

๐Ÿค“ How to cite

If you use this dataset, please cite.

  • Data Article from Data In Brief:
Nievas Offidani, M. A., & Delrieux, C. A. (2024). Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990โ€“2023). In Data in Brief (Vol. 52, p. 110008). Elsevier BV. https://doi.org/10.1016/j.dib.2023.110008
  • Dataset from Zenodo:
Nievas Offidani, M., & Delrieux, C. (2023). The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases, Labeled Images and Captions from Open Access PMC Articles (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10079370

๐Ÿ‘‹ Final Words

That's all the basics that you need to know to create a customized subset based on the MultiCaRe Dataset. I hope this was useful!

Just a few comments before you leave:

  • Contributions are welcome! If you would like to collaborate on this project, feel free to open pull requests or submit issues.
  • If you find this project useful or interesting, please consider giving it a star โญ. It's a great way to show your support and helps the project gain visibility.
  • If you have any questions, suggestions, or just want to say hello, feel free to reach out. You can contact me on LinkedIn.

Thank you! And goodbye for now!

multicare_dataset's People

Contributors

mauro-nievoff avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.