broadinstitute / cellprofiler-on-terra Goto Github PK

7.0 4.0 3.0 186 KB

Run CellProfiler on Terra. Contains workflows that enable a full end-to-end Cell Painting pipeline.

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 2.81% WDL 54.82% Python 18.30% Shell 5.40% Jupyter Notebook 18.67%

cellprofiler-on-terra's Introduction

CellProfiler on Terra

WDL workflows and scripts for running a CellProfiler pipeline on Google Cloud hardware. Includes workflows for all steps of a full Cell Painting pipeline.

Works well in Terra, and will also work on any Cromwell server that can run WDLs. Currently specific to a Google Cloud backend. (We are open to supporting more backends, specifically cloud storage locations, in the future, including AWS and Azure.)

You can see these workflows in action and try them yourself in Terra workspace cellpainting!

Three pipelines:

Cell Painting
- All the workflows necessary to run an end-to-end Cell Painting pipeline, starting with raw images and ending with extracted features, both in database format and aggregated as CSV files.
- Appropriate for datasets of arbitrary size.
- Scatters the time-consuming analysis steps over many VMs in parallel. By default, a dataset is split into individual wells, and each well is run on a separate VM.
Cytominer
- Run the cytominer-database ingest step to create a SQLite database containing all the extracted features.
- Run the aggregation step from pycytominer to create CSV files.
CellProfiler (distributed or single VM)
- A single WDL workflow that runs a CellProfiler .cppipe pipeline on a dataset.

How to run these workflows yourself

These workflows are all publicly available, and hosted in Dockstore. From there, you can import and run the workflows in Terra or any other place you like to run WDL workflows.

You can clone the Terra workspace cellpainting, which is conveniently preconfigured to run on three plates of sample data, if you just want to give it a try.

cellprofiler-on-terra's People

Contributors

Stargazers

Watchers

Forkers

deflaux shntnu broadryan

cellprofiler-on-terra's Issues

Automatic tests on a small dataset using GitHub actions

Continuous integration tests that run all WDLs on real data.

This is a longer-term goal for me, as I'm sure it will take some effort to get it to work. But there are examples of repositories that use GitHub actions to set up a Cromwell instance and test WDLs using it. It would require using secrets to use gcloud credentials, and it would require setting up a Cromwell server in a test. I'm sure it will take a little bit of trial and error.

But this would be great to go with the womtool validation #28

Consider making use of preemptibles configurable for long running tasks

I was recently testing the cytomining workflow and it was preempted after running for 3 hours, and so took an additional 3.5 hours to complete the test.

Preemptibles are currently hardcoded as the default for the cytomining workflow and other workflows in this collection. Its a good default! Please also consider allowing use of preemptibles to be optional for long-running workflows such as cytomining and illumination correction. Thanks!!

Reorganization of folders in the repo, and documentation

Our current split into a "single VM" pipeline versus a "distributed" pipeline reflects our historical development, but I am not sure it is the best way to explain these pipelines to others.

Proposal:

Have one main folder in the base directory called utils: this will have the WDL with the common utility tasks
Have one main folder in the base directory called pipelines
- Have three subfolders in there that are called cellprofiler, cell_painting, and mining
  - cellprofiler: this should still be distributed (scattered), but should just be the simple single task of running one cppipe file on data. (The WDL will call the utility task.)
  - cell_painting: this is the full pipeline for analyzing Cell Painting data, end to end, starting with images and ending with feature aggregation. This is what's currently the "distributed" pipeline. Is it accurate to call it a "Cell Painting pipeline"? (All these WDLs call sub-tasks from the utilities WDL.)
  - mining: it's probably worth keeping this around as a separate "pipeline" because maybe someone will want to use it as a stand-alone step (even if they didn't use the pipelines in this repo for the first part). (The WDL will call a utility task for cytomining: the same task called by the cytomining workflow that's part of the Cell Painting pipeline. No code duplication.)
The "single VM" workflow will disappear: if people want to run on one VM for some reason, then there should be an option to not scatter (do we have that currently?) as part of the utility CellProfiler WDL task.

Each of the subfolders (cellprofiler, cell_painting, and mining) will have its own README with documentation.

Any thoughts?

multicloud: check whether the output bucket is writable

As part of one possible approach to enable #41, rename gcloud_is_bucket_writable to is_bucket_writable and add write permissions tests for the other clouds.

For example, logic exists to check both GCS and S3 buckets in https://github.com/broadinstitute/cellprofiler-on-Terra/blob/v0.3.0/pipelines/mining/cytomining_jumpcp.wdl#L138. It could be refactored out of that WDL and that workflow could be updated to instead call the updated task is_bucket_writable.

Make the WDL imports into local imports

Dockstore can handle these well. But I guess it would be a problem for the Broad Methods repository.

However, some of our WDLs still point to the Broad Methods Repository, so if we forget to update them, then even the Dockstore version of the workflows will be using old imports from the Methods Repository, and will not be synced with GitHub.

Aggregated tables for non default segmentations

In the default configuration, Cytomining aggregation is based on cell, cytoplasm, and nuclei.

Is it possible to include more segmentations like if include a different secondary object (finding vacuoles) in the aggregated table.

Thanks,
Shams

Add a test suite that verifies all WDLs using womtool on each PR

Again, I haven't engineered this kind of thing myself yet, but I've seen it done many times.

For long-term maintainability, and to enable more external contributions, it's nice to have the ability to run automatic tests when a pull request is created. For this repository, the only thing I'd imagine doing is validating the WDLs using womtool. We could imagine a real end-to-end integration test where we actually run all the pipelines on a small test dataset, but this (I think) would require a good deal more engineering.

multicloud: update the cytomining workflow to have multiple tasks

As part of one possible approach to enable #41, update cytomining.wdl to have multiple tasks.

Specifically, call task is_bucket_writable (renamed from gcloud_is_bucket_writable) at the beginning of the cytomining workflow and call task extract_and_copy (renamed from extract_and_gsutil_rsync) at the end so that it can take advantage of multi-cloud support.

(Note that cytomining_jumpcp.wdl is a bit more complicated, as it has support to perform federated auth to AWS when running in GCP so that inputs can be read from S3 and outputs can be written to S3.)

multicloud: rename any ‘gsurl’ parameters to just be ‘url’

As part of one possible approach to enable #41, rename any ‘gsurl’ parameters to just be ‘url’. Also update corresponding comments and other documentation.

See https://github.com/broadinstitute/cellprofiler-on-Terra/search?q=gsurl

Consider adding dependency `crcmod` to Docker images.

I noticed the warning below when using image us.gcr.io/broad-dsde-methods/cytomining:0.0.3 with the cytomining workflow. Consider adding package crcmod to this Docker image, and any other Docker images used by these workflows with calls to gsutil. Installation instructions are shown via gsutil help crcmod.

WARNING: gsutil rsync uses hashes when modification time is not available at
both the source and destination. Your crcmod installation isn't using the
module's C extension, so checksumming will run very slowly. If this is your
first rsync since updating gsutil, this rsync can take significantly longer than
usual. For help installing the extension, please see "gsutil help crcmod".

multicloud: specify recommended data for testing and validation

To start work on #41, the CellProfiler methods team let us know a recommended plate to use from their recent data release.

images s3://cellpainting-gallery/cpg0016-jump/source_4/images/2021_07_12_Batch8/images/BR00125638__2021-07-17T15_13_21-Measurement1/Images/
load data csv files s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/load_data_csv/2021_07_12_Batch8/BR00125638/
illumination correction files s3://cellpainting-gallery/cpg0016-jump/source_4/images/2021_07_12_Batch8/illum/BR00125638/
Cell Profiler analysis results s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/analysis/2021_07_12_Batch8/BR00125638/analysis/
Cytomining results s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_07_12_Batch8/BR00125638/
illumination correction cppipe file https://github.com/broadinstitute/imaging-platform-pipelines/blob/master/JUMP_production/JUMP_illum_LoadData_v1.cppipe
analysis cppipe file https://github.com/broadinstitute/imaging-platform-pipelines/blob/master/JUMP_production/JUMP_analysis_v3.cppipe

Next steps:

for the test plate BR00125638, locate the corresponding files:
1. config.yml
2. plate map
create cloud-specific inputs.json files for each of the workflows, using place holder comments for any parameter values that are not yet available and send a pull request to add them to branch multicloud

Tolerate `images_directory_gsurl` values with a final slash

I was passing a GCS URL with a final slash and that caused the workflow to fail. See https://job-manager.dsde-prod.broadinstitute.org/jobs/e67da343-7a05-454d-aa53-6479e1a481d8

Suggestion: consider updating the workflow to be tolerant of a final slash if users include it.

Thanks very much for these great workflows!

Consider adding a user actionable message for assertions that fail.

I passed an incorrect yaml file to create_load_data and it failed with the following message. See https://job-manager.dsde-prod.broadinstitute.org/jobs/c0cd7862-f1ad-49fc-9327-cb4ddbf224ce

File "/scripts/commands.py", line 213, in convert_to_dataframe
assert channel in channels
AssertionError

Suggestion: consider adding a user actionable message for assertions that fail such as https://github.com/broadinstitute/cellprofiler-on-Terra/blob/master/cellprofiler_distributed/scripts/commands.py#L213. For example:

assert channel in channels, f'''{channel} from {config_file_path} not found in {channels}.
Correct the list of channels in the config file and try again.'''

Cellprofiler ExportedSpreadsheet cdv files empty

I set up a pipeline to analyze worm images. I wanted to export the image intensity data to a spreadsheet, so I chose to export them as csv files. However, when I open the csv files, they're empty. No data of std intensity, integrated intensity, etc. However, when I was running the pipelines, those measurements were did show up. I have attached my pipeline and also the exported spreadsheets. I hope somebody can help me out.
MyExpt_Experiment.csv

cpd_analysis_pipeline has no "outputs"

Should we have an output block in that WDL? It seems like it's alright if people want to ignore the outputs, but we should probably specify some outputs, even if it's just an output directory

multicloud: update all tasks that use `gsutil` to have an “if” statement for other clouds

As part of one possible approach to enable #41, for all of these tasks:

Rename them and update them to use an if statement to execute the correct CLI for the bucket url.

For example extract_and_gsutil_rsync could change in the following way to add support for S3:

task extract_and_copy {
    # WARNING: This task can potentially overwrite bucket data
    # if destination_url is not empty.

  input {
    # Input and output files
    File tarball
    String destination_url
  }

  command <<<
    set -o errexit
    set -o pipefail
    set -o nounset
    set -o xtrace

    # untar the files
    mkdir sync_files
    tar -xvzf ~{tarball} -C sync_files

    if [[ ~{destination_url} == "s3://"* ]]; then
      aws s3 cp --acl bucket-owner-full-control --recursive sync_files/ ~{destination_url}/
    else
      gsutil -m cp -r sync_files/* ~{destination_url}/
    fi
  >>>

  output {
      String output_directory = destination_url
  }

File localization / delocalization speedup

There seems to be a ~60x speedup to be had by changing file localization from

gsutil cp

gcloud storage cp

Broadies see note here: https://broadinstitute.enterprise.slack.com/archives/D01MH724EGJ/p1699500578322099

cytomining: allow sqlite as input

Implement improvement for the cytomining workflow allowing sqlite as an input: if it doesn't exist should be created, if it exist load the existing one.
This will allow create the profiles (aggregate) using a different aggregation method (mean vs median) and different normalizations setting, without having to ingest the sqlite which takes several hours.
Bonus improvement: allow additional inputs for aggregation, normalize & SingleCells to expand functionality.

Fail fast if permissions are not correct for output paths

I created a data table for someone else to use with all the parameters needed for the cytomining workflow. When I tested it logged in as a separate user, the job failed after 4 hours because the data table mistakenly had a a GCS path in parameter output_directory_gsurl for which the separate user did not have WRITE permission.

Consider proactively checking that all output paths are writable before beginning computation, and fail fast when the permissions are incorrect. (If I recall correctly, Cromwell performs this check for intputs and outputs from GCS, this suite of workflows is not using Cromwell for file delocalization.) Thank you!

Create `pngs` of all resulting illumination correction `*npy` files.

I'd like to send a pull request with this change. Please let me know if that sounds good and/or if you have a different suggestion?

I'm planning to add some code right after the call to cellprofiler to determine whether any *.npy files occur in the output subdirectory. If so, run some Python matplotlib code to create a png for each npy. They will get bundled up with the other output files, and available for viewing by users, but should be ignored by all downstream workflows.

Alternatively, I could write the png files to a second, additional Cromwell output, when they exist.

Consider being agnostic to the backend (currently google cloud)

(Only if people actually want / need this. But I assume some people might. I think the Imaging Platform stores a lot of data on AWS.)

Supposedly Terra will be supporting multiple backends (GCP, AWS, Azure) in the near future. All of our "gsutil" commands (which kind of break the usual WDL logic) only work on GCP.

We should think about whether we can do everything strictly in WDL, without any gsutil commands. Or whether we can have separate sorts of "cloud file copying" commands for separate backends, calling the right ones where appropriate.

Tolerate cppipe files that specify the path to `load_data.csv`.

I tried run cp_illumination_pipeline but it failed. See https://job-manager.dsde-prod.broadinstitute.org/jobs/368c6a51-e84c-453a-b242-88cff962b6ea

From the log, the load_data.csv file I created was transferred properly, but for some reason CellProfiler was passed an incorrect default path to the file. Per @sjfleming this was because the cppipe file I used had the following lines:

    Input data file location:Default Input Folder sub-folder|Downloads
    Name of the file:load_data (1).csv

We confirmed that adding the --data-file parameter to https://github.com/broadinstitute/cellprofiler-on-Terra/blob/master/cellprofiler_distributed/cellprofiler_distributed_utils.wdl#L472 fixes the issue:

    cellprofiler --run --run-headless \
      -p ~{cppipe_file}  \
      --data-file=~{load_data_csv} \
      -o output \
      -i $csv_dir