Giter Site home page Giter Site logo

lidc-idri-preprocessing's Introduction

LIDC Preprocessing with Pylidc library

Medium Link

This repository would preprocess the LIDC-IDRI dataset. We use pylidc library to save nodule images into an .npy file format. The code file structure is as below

+-- LIDC-IDRI
|    # This file should contain the original LIDC dataset
+-- data
|    # This file contains the preprocessed data
|   |-- _Clean
|       +-- Image
|       +-- Mask
|   |-- Image
|       +-- LIDC-IDRI-0001
|       +-- LIDC-IDRI-0002
|       +-- ...
|   |-- Mask
|       +-- LIDC-IDRI-0001
|       +-- LIDC-IDRI-0002
|       +-- ...
|   |-- Meta
|       +-- meta.csv
+-- figures
|    # Save figures here
+-- notebook
|    # This notebook file edits the meta.csv file to make indexing easier
+-- config_file_create.py
|    # Creates configuration file. You can edit the hyperparameters of the Pylidc library here
+-- prepare_dataset.py
|    # Run this file to preprocess the LIDC-IDRI dicom files. Results would be saved in the data folder
+-- utils.py
     # Utility script

Segmented Image

1.Download LIDC-IDRI dataset

First you would have to download the whole LIDC-IDRI dataset. On the website, you will see the Data Acess section. You would need to click Search button to specify the images modality. I clicked on CT only and downloaded total of 1010 patients.

2. Set up pylidc library

You would need to set up the pylidc library for preprocessing. There is an instruction in the documentation. Make sure to create the configuration file as stated in the instruction. Right now I am using library version 0.2.1

3. Explanation for each python file

python config_file_create.py

This python script contains the configuration setting for the directories. Change the directories settings to where you want to save your output files. Without modification, it will automatically save the preprocessed file in the data folder. Running this script will create a configuration file 'lung.conf'

This utils.py script contains function to segment the lung. Segmenting the lung and nodule are two different things. Segmenting the lung leaves the lung region only, while segmenting the nodule is finding prosepctive lung nodule regions in the lung. Don't get confused.

python prepare_dataset.py

This python script will create the image, mask files and save them to the data folder. The script will also create a meta_info.csv file containing information about whether the nodule is cancerous. In the LIDC Dataset, each nodule is annotated at a maximum of 4 doctors. Each doctors have annotated the malignancy of each nodule in the scale of 1 to 5. I have chosed the median high label for each nodule as the final malignancy. The meta_csv data contains all the information and will be used later in the classification stage. This prepare_dataset.py looks for the lung.conf file. The configuration file should be in the same directory. Running this script will output .npy files for each slice with a size of 512*512

To make a train/ val/ test split run the jupyter file in notebook folder. This will create an additional clean_meta.csv, meta.csv containing information about the nodules, train/val/test split.

A nodule may contain several slices of images. Some researches have taken each of these slices indpendent from one another. However, I believe that these image slices should not be seen as independent from adjacent slice image. Thus, I have tried to maintain a same set of nodule images to be included in the same split. Although this apporach reduces the accuracy of test results, it seems to be the honest approach.

4. Data folder

the data folder stores all the output images,masks. inside the data folder there are 3 subfolders.

1. Clean

The Clean folder contains two subfolders. Image and Mask folders. Some patients don't have nodules. In the actual implementation, a person will have more slices of image without a nodule. To evaluate our generalization on real world application, we save lung images without nodules for testing purpose. These images will be used in the test set.

2. Image

The Image folder contains the segmented lung .npy folders for each patient's folder

3. Mask

The Mask folder contains the mask files for the nodule.

4. Meta

The Meta folder contains the meta.csv file. The csv file contains information of each slice of image: Malignancy, whether the image should be used in train/val/test for the whole process, etc.

5. Contributing and Acknowledgement

I started this Lung cancer detection project a year ago. I was really a newbie to python. I didn't even understand what a directory setting is at the time! However, I had to complete this project for some personal reasons. I looked through google and other githubs. But most of them were too hard to understand and the code itself lacked information. I hope my codes here could help other researchers first starting to do lung cancer detection projects. Please give a star if you found this repository useful.

here is the link of github where I learned a lot from. Some of the codes are sourced from below.

  1. https://github.com/mikejhuang/LungNoduleDetectionClassification

lidc-idri-preprocessing's People

Contributors

jaeho3690 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lidc-idri-preprocessing's Issues

Issue in LIDC-IDRI-Segmentation

The issue that I am raising is regarding LIDC-IDRI-Segmentation project.
Since I was not able to receive a reply there, I am posting here and I apologize for posting query in some other project.

I am unable to find the function crop_nodule in View_output.ipynb.
The function crop_nodule is called in function crop_patch.

问题

你好,我发现你的代码中没有对hu值进行处理,导致我的分割模型精度并不是很高。你有解决方案吗?

FileNotFoundError for "Clean" data

Hi Jaeho, my name is Harsha. Thank you for this repository. I am using it to preprocess data for my segmentation model. My issue is that as soon as your script encounters the first "clean" data files, it throws the following error:
Patient ID: LIDC-IDRI-0028 Dicom Shape: (512, 512, 141) Number of Annotated Nodules: 0
Clean Dataset LIDC-IDRI-0028
Traceback (most recent call last):
File "prepare_dataset.py", line 173, in
test.prepare_dataset()
File "prepare_dataset.py", line 157, in prepare_dataset
np.save(patient_clean_dir_mask / mask_name, lung_mask)
File "<array_function internals>", line 6, in save
File "/home/ramanha/env3/lib/python3.5/site-packages/numpy/lib/npyio.py", line 541, in save
fid = open(file, "wb")
FileNotFoundError: [Errno 2] No such file or directory: 'data/Clean/Mask/LIDC-IDRI-0028/LIDC-IDRI-0028/0028_CM001_slice000.npy'

Error running prepare_dataset

Hello, when I run "vol = scan.to_volume()", the program reported an error: "RuntimeError: Could not establish path to dicom files. Have you specified the path option in the configuration file C:\Users\Administrator\pylidc .conf?", I just learned Python not long ago, and I really can’t solve it. I hope to get your advice. Thank you very much.

dataset

Do I have to download all the data? LIDC dataset

Error saving clean dataset

prepare_dataset.py gave me an error (NotADirectory) while saving clean dataset.

but I already resolved it by changing these lines (line 152),

nodule_name = "{}/{}_CN001_slice{}".format(pid,pid[-4:],prefix[slice])
mask_name = "{}/{}_CM001_slice{}".format(pid,pid[-4:],prefix[slice])

into,

nodule_name = "{}_CN001_slice{}".format(pid[-4:],prefix[slice])
mask_name = "{}_CM001_slice{}".format(pid[-4:],prefix[slice])

File not found error

I have created all the folders as mentioned but still getting this error

FileNotFoundError Traceback (most recent call last)
in
166
167 test= MakeDataSet(LIDC_IDRI_list,IMAGE_DIR,MASK_DIR,CLEAN_DIR_IMAGE,CLEAN_DIR_MASK,META_DIR,mask_threshold,padding,confidence_level)
--> 168 test.prepare_dataset()

in prepare_dataset(self)
130
131 self.save_meta(meta_list)
--> 132 np.save(patient_image_dir / nodule_name,lung_segmented_np_array)
133 np.save(patient_mask_dir / mask_name,mask[:,:,nodule_slice])
134 else:

<array_function internals> in save(*args, **kwargs)

~\Anaconda3\envs\cpu_env\lib\site-packages\numpy\lib\npyio.py in save(file, arr, allow_pickle, fix_imports)
539 if not file.endswith('.npy'):
540 file = file + '.npy'
--> 541 fid = open(file, "wb")
542 own_fid = True
543

FileNotFoundError: [Errno 2] No such file or directory: 'D:\data\Image\LIDC-IDRI-0001\LIDC-IDRI-0001\0001_NI000_slice000.npy'

ImportError

Good evening friend,

Trying to run the prepare_dataset code but not successful. I keep getting this error;

ImportError: cannot import name 'is_dir_path'

Please what can I do

cluster_annotations

Hi, my name is David, I would like to thank you for taking the time to share your knowledge, it was really helpful for understanding such a complex topic.
A question, at the moment I try to run prepare_dataset I got this error message, I am a little bit lost, perhaps could you guide me with this?
0%| | 0/135 [00:00<?, ?it/s]
Traceback (most recent call last):
File "prepare_dataset.py", line 173, in
test.prepare_dataset()
File "prepare_dataset.py", line 99, in prepare_dataset
nodules_annotation = scan.cluster_annotations()
AttributeError: 'NoneType' object has no attribute 'cluster_annotations'

AttributeError: 'NoneType' object has no attribute 'cluster_annotations'

Hi Jaeho! First of all thanks for the detailed explanation for the preprocess. I can follow so far for my first project try on this data.

I encounter a problem after running the code. Can I ask for a help to solve this issue? Thank you very much

AttributeError Traceback (most recent call last)
in
155
156 test= MakeDataSet(LIDC_IDRI_list,IMAGE_DIR,MASK_DIR,CLEAN_DIR_IMAGE,CLEAN_DIR_MASK,META_DIR,mask_threshold,padding,confidence_level)
--> 157 test.prepare_dataset()

in prepare_dataset(self)
81 pid = patient #LIDC-IDRI-0001~
82 scan = pl.query(pl.Scan).filter(pl.Scan.patient_id == pid).first()
---> 83 nodules_annotation = scan.cluster_annotations()
84 vol = scan.to_volume()
85 print("Patient ID: {} Dicom Shape: {} Number of Annotated Nodules: {}".format(pid,vol.shape,len(nodules_annotation)))

AttributeError: 'NoneType' object has no attribute 'cluster_annotations'

ImportError

Thanks for your response.

Utils.ipynb is there. I ran it. I found the function is_dir_path inside also but I keep getting 'cannot import name is_dir_path' error message.

Please do you have any other advice for me?

Thank you
15987302683661984757525049354138

problem in running the code

RuntimeError: Could not establish path to dicom files. Have you specified the path option in the configuration file E:\Users\oqla2\pylidc.conf?

Installation help

Hi,
First of thank you so much for the effort put into the guide you wrote up and all this code! I know basically nothing about pre-processing and this is really helping me out.
To start, I wanted to just get your preprocessing code working. I created a new anaconda env and downloaded all the necessary packages. Sadly pylidc available so I downloaded pip in anaconda and used pip to install the package.

Now when I run my code, I'm not getting any errors about imports anymore, but I get this error:

AttributeError: 'NoneType' object has no attribute 'cluster_annotations'

For this line:
nodules_annotation = scan.cluster_annotations()

Is this a pylidc installation issue?
Also can you please share exactly how you downloaded the packages (did you install all of them through pip?) so I can try that instead?

Thanks!

NoneType' object has no attribute 'cluster_annotations'

I am getting error in LIDC-IDRI preprocessing
~\AppData\Local\Temp\ipykernel_11144\2002799093.py in prepare_dataset(self)
140
141 for patient in tqdm(self.IDRI_list):
--> 142 pid = LIDC-IDRI-0x17
143 scan = pl.query(pl.Scan).filter(pl.Scan.patient_id == pid).first()
144 nodules_annotation = scan.cluster_annotations()

if I am writing pid= patient
then getting error
~\AppData\Local\Temp\ipykernel_11144\2591889043.py in prepare_dataset(self)
142 pid = patient
143 scan = pl.query(pl.Scan).filter(pl.Scan.patient_id == pid).first()
--> 144 nodules_annotation = scan.cluster_annotations()
145 vol = scan.to_volume()
146 print("Patient ID: {} Dicom Shape: {} Number of Annotated Nodules: {}".format(pid,vol.shape,len(nodules_annotation)))

AttributeError: 'NoneType' object has no attribute 'cluster_annotations'

How many classes

Hi Jaeho, this is rather a (stupid) question than an issue. What are the number of classes for segmentation from this dataset? 2, right? Malignant or not?

RuntimeError: Could not establish path to dicom files.

hi friend
I am a beginner in python.I am new to pylidc. can u tel me how to specifiy the path option in the configuration file C:\Users\varun\pylidc.conf?.
My LIDC data is available in folder D:\LIDCPREPROCESSING CODE\LIDC-IDRI-Preprocessing-master\LIDC-IDRI.
expecting positive response
chinnu

Generating masks for multiple nodules within the same slice

Hello, your repo has been extremely helpful in the data preprocessing for the LIDC dataset. much thanks.

There is one question that I have which is regarding the nodule masks generation.

For example, if Slice_50 contains 2 nodules, this code will generate 2 npy images for the lung, and 2 npy masks for the nodule right?

The generated npy images for the lung will be the same slice_50, however there will be 2 respective npy masks for each of the nodules within the slice_50.

How will this affect the training and validation accuracies?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.