Giter Site home page Giter Site logo

rsp-old's People

Contributors

cytronicoder avatar redocinortyc avatar

Watchers

 avatar

rsp-old's Issues

Redundant 360° Scanning in RSP Implementation

There's a redundancy issue in our Radar Scanning Plot (RSP) implementation where the scan erroneously extends to a full 360°. This causes an unnecessary scan of the starting position, effectively duplicating the data point at 0° and 360°, which could lead to misinterpretation of the plot or skewing of any statistical analysis.

Build RSP analysis app using Dash

Self-explanatory issue - both the sim and the main app should work smoothly.

  • Make sure that proper documentation is provided on how to use the apps.
  • PR should only be accepted after #12 is closed.

Documentation Needed using Sphinx

  1. Set up Sphinx in the repo.
  2. Document every function with descriptions, parameters, return values, and exceptions.
  3. Provide an overview and detailed descriptions for each file.
  4. Configure Sphinx to auto-generate documentation from docstrings.
  5. (Optional) Integrate the documentation with Read the Docs for hosting.
  6. Review all documentation for consistency and clarity.

Verify correctness of generated `master_file.csv`

Description

We've generated a CSV file named master_file.csv, which contains gene-related data, including gene name, coverage, mean expression, total expression, and RSP area. Before proceeding with further analyses or sharing the dataset, we must ensure this CSV is correct and contains accurate data.

Steps to reproduce

  1. Clone the repository and navigate to the relevant directory
  2. Check the file master_file.csv
  3. Use any relevant scripts or methods to regenerate the CSV
  4. Compare the regenerated CSV with the provided master_file.csv

Expected behavior

The CSV should accurately represent the gene data as per our analysis scripts. All entries should match expectations, and there should be no missing or erroneous values.

Tasks

  • Review the structure and headers of the CSV to ensure they match expectations.
  • Check for any missing or NaN values.
  • Compare the new CSV with the previous one generated with the old program.
  • Verify that expression and area values are within expected ranges.
  • Confirm that coverage values lie between 0% and 100%.

Simulated data

In the first simulation, a point was randomly generated with coordinates between -1 and 1 using uniform sampling. This point was accepted if it satisfied the condition x^2 + y^2 ≤ 1. This process was used to generate cluster points. For each simulation, a certain percentage of points (ranging from 5% to 95%) were randomly chosen and assumed to represent the cells expressing the gene.

Implement a function that should generate a specified number of points within a unit circle. A given percentage of these points should be marked as expressing cells, and the distribution of these expressing cells can either be uniform/random (even) or clustered together (biased).

Support for analyzing additional datasets

The RSP tool has shown promising results with the current neonatal mouse heart dataset. Analyzing additional datasets would be beneficial to enhance its capabilities and applicability further.

Implementation considerations

  • Data Preprocessing: Each dataset might come with its own set of challenges in terms of preprocessing. It would be essential to have a flexible preprocessing pipeline.
  • Scalability: Some of the new datasets might be significantly larger. Ensuring that the RSP tool scales well with larger datasets would be crucial.
  • Documentation: For each new dataset supported, comprehensive documentation detailing the preprocessing steps, potential use cases, and any dataset-specific nuances would benefit users.

Pinned issue for TODOs

Here's an ever-updating list of TODOs I have to look into following my research meeting:

  • Visualize the relationship between the coverage and the RSP plot
  • #19
  • Look at the “Others” category in the PGC rank pie chart and try to classify those genes
  • #20
    • Get percentages of studies per category
  • What are the results?
    • Conclusion 1: RSP plots are more sensitive regarding coverage (compared to PGC, for example) and can be used to discover shape.
    • Conclusion 2: Look at a specific cluster (indicate cluster type and marker genes) and conduct a comparison analysis between the two methods (PGC and RSP) + ranking analysis
    • Conclusion 3 - Novel: Low coverage genes with some bias (non-zero threshold for gene expression?) and their respective RSP plots
      • Finding gene pairs that qualify as shaping the same and different - cardiomyocytes related (different stages)
      • Conclusion 4: Get t-SNE and manually plant the center of the RSP plot to compare the genes - scan all of these clusters together
  • Email session chair about my status as a high school scholar working with Professor Chen - wish to speak + network at the Young Scientist Session

Run PAGER, classify by GOA and WikiPathways, and feed into ChatGPT to summarise

In order to enhance the analysis and summarization capabilities of our gene analysis pipeline, I propose the integration of PAGER with the classification of results using Gene Ontology Annotation (GOA) and WikiPathways, and subsequent summarization using ChatGPT. This feature will allow us to understand the gene sets better and provide succinct summaries for the end-users.

Proposed Workflow

  1. PAGER Execution
    • Run PAGER on the identified gene sets to analyze genome-wide gene set relationships.
  2. Classification
    • Classify the PAGER results using Gene Ontology Annotation (GOA) and WikiPathways to categorize the gene sets into biologically meaningful groups.
  3. Summarization with ChatGPT
    • Feed the classified gene sets and their relationships into ChatGPT to generate a user-friendly summary elucidating the significant findings.

Work out a CSV with simulated data (different ranges of coverages)

In order to better test and understand the performance and behavior of our analysis scripts, it would be beneficial to have a CSV file containing simulated data with different ranges of coverage. This file should have a structured format that mimics real data but with controlled, known values to cover a variety of scenarios we might encounter.

Marker gene highlighting and cluster filtering

The current t-SNE plotting function allows a basic visualization of the data clustered with DBSCAN. However, it lacks the ability to highlight specific genes and filter clusters, which can be crucial for more targeted analysis.

Highlight Marker Gene(s)

  • Add the ability to highlight one or more marker genes in the t-SNE plot.
  • Users should be able to specify the color(s) for the marker gene(s) (e.g., red, blue, green, etc.).
  • Marker genes should be prominent against the background genes, which should be displayed in a muted color (e.g., grey).

Display Specific Cluster(s)

  • Allow users to specify one or more clusters to be displayed on the t-SNE plot.
  • Other clusters should either be hidden or displayed in a muted color for context.

Migrate codebase to utilize `scanpy` and `BioPython`

Background

Our current pipeline utilizes a mix of custom implementations and various libraries. I'm considering migrating some parts of the code to specialized libraries such as scanpy for single-cell analysis and BioPython for bioinformatics operations.

Action Items

  1. Code Audit: Identify areas in our codebase that can benefit from scanpy and BioPython.
  2. Refactor & Migrate: Begin the process of refactoring the identified code areas. This would include:
    • Using scanpy for t-SNE generation, clustering, and visualization.
    • Leveraging BioPython for sequence analysis, file format conversions, and other bioinformatics tasks.
  3. Testing: Rigorous testing of the migrated code to ensure accuracy and efficiency.
  4. Documentation: Update documentation to reflect the changes and provide guidelines for using the new libraries.

Update `generate_polygon` function parameters

Description

Currently, our generate_polygon function signature is:

generate_polygon(coordinates, is_expressing, theta_bound=[0, 2 * np.pi])

For improved functionality and user flexibility, we should update the function to support the following parameters:

generate_polygon(dge_file, marker_gene=None, target_cluster=None, theta_bound=...)

Proposed Changes

  1. Replace coordinates and is_expressing parameters with dge_file, marker_gene, and target_cluster.
  2. Update function logic: Adjust the internal logic of the generate_polygon function to work with the new parameters and ensure it performs as expected.
  3. Test the function: After making the changes, thoroughly test the function to ensure no regression.

Benefits

  • By using a dge_file, we can directly input the file and extract the necessary information, making the function more versatile.
  • The optional parameters marker_gene and target_cluster will give users more targeted outputs, allowing them to focus on specific genes or clusters.

Additional Notes

Please ensure backward compatibility or provide a migration guide if backward compatibility is not maintained.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.