The rsp-old from redocinortyc

Redundant 360° Scanning in RSP Implementation

There's a redundancy issue in our Radar Scanning Plot (RSP) implementation where the scan erroneously extends to a full 360°. This causes an unnecessary scan of the starting position, effectively duplicating the data point at 0° and 360°, which could lead to misinterpretation of the plot or skewing of any statistical analysis.

Build RSP analysis app using Dash

Self-explanatory issue - both the sim and the main app should work smoothly.

Make sure that proper documentation is provided on how to use the apps.
PR should only be accepted after #12 is closed.

Documentation Needed using Sphinx

Set up Sphinx in the repo.
Document every function with descriptions, parameters, return values, and exceptions.
Provide an overview and detailed descriptions for each file.
Configure Sphinx to auto-generate documentation from docstrings.
(Optional) Integrate the documentation with Read the Docs for hosting.
Review all documentation for consistency and clarity.

Verify correctness of generated `master_file.csv`

Description

We've generated a CSV file named master_file.csv, which contains gene-related data, including gene name, coverage, mean expression, total expression, and RSP area. Before proceeding with further analyses or sharing the dataset, we must ensure this CSV is correct and contains accurate data.

Steps to reproduce

Clone the repository and navigate to the relevant directory
Check the file master_file.csv
Use any relevant scripts or methods to regenerate the CSV
Compare the regenerated CSV with the provided master_file.csv

Expected behavior

The CSV should accurately represent the gene data as per our analysis scripts. All entries should match expectations, and there should be no missing or erroneous values.

Tasks

Review the structure and headers of the CSV to ensure they match expectations.
Check for any missing or NaN values.
Compare the new CSV with the previous one generated with the old program.
Verify that expression and area values are within expected ranges.
Confirm that coverage values lie between 0% and 100%.

Simulated data

In the first simulation, a point was randomly generated with coordinates between -1 and 1 using uniform sampling. This point was accepted if it satisfied the condition x^2 + y^2 ≤ 1. This process was used to generate cluster points. For each simulation, a certain percentage of points (ranging from 5% to 95%) were randomly chosen and assumed to represent the cells expressing the gene.

Implement a function that should generate a specified number of points within a unit circle. A given percentage of these points should be marked as expressing cells, and the distribution of these expressing cells can either be uniform/random (even) or clustered together (biased).

Support for analyzing additional datasets

The RSP tool has shown promising results with the current neonatal mouse heart dataset. Analyzing additional datasets would be beneficial to enhance its capabilities and applicability further.

Implementation considerations

Data Preprocessing: Each dataset might come with its own set of challenges in terms of preprocessing. It would be essential to have a flexible preprocessing pipeline.
Scalability: Some of the new datasets might be significantly larger. Ensuring that the RSP tool scales well with larger datasets would be crucial.
Documentation: For each new dataset supported, comprehensive documentation detailing the preprocessing steps, potential use cases, and any dataset-specific nuances would benefit users.

Pinned issue for TODOs

Here's an ever-updating list of TODOs I have to look into following my research meeting:

Run PAGER, classify by GOA and WikiPathways, and feed into ChatGPT to summarise

In order to enhance the analysis and summarization capabilities of our gene analysis pipeline, I propose the integration of PAGER with the classification of results using Gene Ontology Annotation (GOA) and WikiPathways, and subsequent summarization using ChatGPT. This feature will allow us to understand the gene sets better and provide succinct summaries for the end-users.

Proposed Workflow

PAGER Execution
- Run PAGER on the identified gene sets to analyze genome-wide gene set relationships.
Classification
- Classify the PAGER results using Gene Ontology Annotation (GOA) and WikiPathways to categorize the gene sets into biologically meaningful groups.
Summarization with ChatGPT
- Feed the classified gene sets and their relationships into ChatGPT to generate a user-friendly summary elucidating the significant findings.

Work out a CSV with simulated data (different ranges of coverages)

In order to better test and understand the performance and behavior of our analysis scripts, it would be beneficial to have a CSV file containing simulated data with different ranges of coverage. This file should have a structured format that mimics real data but with controlled, known values to cover a variety of scenarios we might encounter.

Marker gene highlighting and cluster filtering

The current t-SNE plotting function allows a basic visualization of the data clustered with DBSCAN. However, it lacks the ability to highlight specific genes and filter clusters, which can be crucial for more targeted analysis.

Highlight Marker Gene(s)

Add the ability to highlight one or more marker genes in the t-SNE plot.
Users should be able to specify the color(s) for the marker gene(s) (e.g., red, blue, green, etc.).
Marker genes should be prominent against the background genes, which should be displayed in a muted color (e.g., grey).

Display Specific Cluster(s)

Allow users to specify one or more clusters to be displayed on the t-SNE plot.
Other clusters should either be hidden or displayed in a muted color for context.

Migrate codebase to utilize `scanpy` and `BioPython`

Background

Our current pipeline utilizes a mix of custom implementations and various libraries. I'm considering migrating some parts of the code to specialized libraries such as scanpy for single-cell analysis and BioPython for bioinformatics operations.

Action Items

Code Audit: Identify areas in our codebase that can benefit from scanpy and BioPython.
Refactor & Migrate: Begin the process of refactoring the identified code areas. This would include:
- Using scanpy for t-SNE generation, clustering, and visualization.
- Leveraging BioPython for sequence analysis, file format conversions, and other bioinformatics tasks.
Testing: Rigorous testing of the migrated code to ensure accuracy and efficiency.
Documentation: Update documentation to reflect the changes and provide guidelines for using the new libraries.

Update `generate_polygon` function parameters

Description

Currently, our generate_polygon function signature is:

generate_polygon(coordinates, is_expressing, theta_bound=[0, 2 * np.pi])

For improved functionality and user flexibility, we should update the function to support the following parameters:

generate_polygon(dge_file, marker_gene=None, target_cluster=None, theta_bound=...)

Proposed Changes

Replace coordinates and is_expressing parameters with dge_file, marker_gene, and target_cluster.
Update function logic: Adjust the internal logic of the generate_polygon function to work with the new parameters and ensure it performs as expected.
Test the function: After making the changes, thoroughly test the function to ensure no regression.

Benefits

By using a dge_file, we can directly input the file and extract the necessary information, making the function more versatile.
The optional parameters marker_gene and target_cluster will give users more targeted outputs, allowing them to focus on specific genes or clusters.

Additional Notes

Please ensure backward compatibility or provide a migration guide if backward compatibility is not maintained.

redocinortyc / rsp-old Goto Github PK

rsp-old's People

Contributors

Watchers

rsp-old's Issues

Description

Steps to reproduce

Expected behavior

Tasks

Implementation considerations

Proposed Workflow

Background

Action Items

Description

Proposed Changes

Benefits

Additional Notes

Recommend Projects

Recommend Topics

Recommend Org