The dicom-attribute-scraper from innolitics

dicom-attribute-scraper's Issues

Set up initial project config placeholders

Build/clean/etc. scripts
Requirements boilerplate
Python version check
Black

Aggregator script

Given multiple JSON mappings from #3, aggregate the results into a single mapping. For example, { tag: “stringA” } and { tag: “stringB” } can be aggregated to a single { tag: [“stringA”, “stringB”] }.

This aggregator should accept the same write method arguments as #3. I.e., if the scraping script outputs JSON and sqlite, we should make sure the aggregator can aggregate both (I suspect the sqlite aggregator would be quite simple).

The source files are not required to contain identical tags. In other words, the number of example values can vary by tag.
Tags that are not included in any of the source files are not included in the output object. That is, the key does not exist in the output object. This script promises to aggregate the inputs but provides no guarantee about the coverage of those inputs.

Create and run script to aggregate DICOM files from the Innolitics library

We need to generate the "big" example file. Since we may need to do this again in the future, we should make a little script to handle it. Here are the tasks I expect we will need to complete, but chime in below if you see anything I have missed.

Information gathering:

Ping Yujan to clarify/confirm file structure (this isn't strictly necessary -- you could write a fully flexible implementation -- but I suspect it will save some time/effort to match the existing structure at least somewhat) and to permission you on Doomfist.

Script:

Find all DICOM files in the example file directories (find may be useful here).
Run each file through the scraper, saving the results in a temporary JSON file.
Aggregate the temporary JSON files using the aggregator.
Clean up any temporary files.

Follow-up:

Retrieve the resulting aggregated file (this may be trivial).
Manually review a few of the Patient fields (name, in particular). I would be very surprised if we had unintentionally stored any files with real patient data, but it's probably worth a 5-minute check before we put the information in front of 10s of thousands of views per week. :)
Create PR in dicom standard browser to add the newly generated example file if everything looks good.

I think a makefile or a shell script would be the best choices for format. I would probably lean toward Make but do not have a strong preference one way or the other.

Modify scraper to delete spaces in tags

Tags used in the browser do not include the space after the comma: "(0008,0008)" instead of "(0008, 0008)". This discrepancy causes the tags to not be recognized in the Dicom browser code, so the spaces must be removed during the attribute scraping.

Add SQLite option to scraper and aggregator

When processing a large number of files, it may be more efficient to write to a simple sqlite table instead of creating a json file for every dicom file. For example:

attribute	value
"0010,0020"	"patient-id-12345"
"0010,0040"	"M"
...	...
"0010,0040"	"O"

This functionality can be used by passing an attribute to the script. E.g., python scraper.py --use-sqlite "sqlite-db-file-name".

Find example DICOM files

It would be nice to have a few shared example files to use during development. There aren't any hard requirements, but ideally, the files would include a variety of attributes (i.e., overlapping attributes are less useful). The analyzer tab at https://dicom.innolitics.com may be useful for roughly gauging overlap.

Attribute mapping script

Given a DICOM file, create a mapping from tag (“(0010,0010)” or its ID or hex equivalent) to value for each attribute in the file, subject to the criteria below.

Deliverable: Python script that accepts a DICOM file path and saves the result. Ex:

python scriptname.py input-file.dcm --json output-file.json

Include only VRs that can be usefully represented by a string. Ex: exclude Unknown, Other Byte, etc.
Sequences (SQ) are ignored; the underlying tags include examples
MVP: Proprietary tags are ignored
Allow for exclusion of user-specified tags (i.e., if we want to exclude a tag for privacy reasons or because it is malformed in the example file)
Truncate example values at [x] characters by default and allow the user to specify a non-default value
The script defaults to JSON output but supports multiple write methods. For example, it should be easy to write to a sqlite database rather than a JSON file. We can start with JSON to nail the design, but if we are running 10s of files, something like sqlite will probably be easier to use in practice.
Automated tests are included. We can write unit tests as needed, but at a minimum, we need a few integration tests on minimal example files.

I suspect https://pydicom.github.io/pydicom/stable/ will be useful.

innolitics / dicom-attribute-scraper Goto Github PK

dicom-attribute-scraper's People

Contributors

Stargazers

Watchers

Forkers

dicom-attribute-scraper's Issues

Set up initial project config placeholders

Aggregator script

Create and run script to aggregate DICOM files from the Innolitics library

Modify scraper to delete spaces in tags

Add SQLite option to scraper and aggregator

Find example DICOM files

Attribute mapping script

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent