Light

pepkit / pepembed Goto Github PK

1.0 1.0 0.0 442 KB

Indexer of pephub database

Python 100.00%

pepembed's Introduction

Install

pip install --user --upgrade https://github.com/pepkit/pepkit/archive/master.zip

Install dev version

pip install --user --upgrade https://github.com/pepkit/pepkit/archive/dev.zip

pepembed's People

Contributors

Stargazers

Watchers

pepembed's Issues

Update mining procedure

Overview

We need to update the function that mines a description from PEPs inside pephub for subsequent embedding and insertion into Qdrant. Currently, the pipeline looks inside each PEP and attempts to extract any project-level attributes. These are typically attributes that describe the data from a global perspective. See here for more info on the architecture. It would be better if we extracted all meaningful text from the PEP and used that to compute an embedding for vector search and retrieval. Example:

Bad:

>>> desc = mine_description(pep)
>>> desc
>>> "Asthma study"

Good:

>>> desc = mine_description(pep)
>>> desc
>>> "More than 200 asthma-associated genetic variants have been identified
    in genome-wide association studies (GWASs). Expression quantitative trait loci"

Design Goals

There are two main design goals for the updated pepembed description mining functionality:

It should extract large, information-rich descriptions from PEPs that include a lot of info (GEO and bedbase PEPs)
It should function the exact same for the big PEPs (GEO, SRA, bedbase) and small PEPs (normal user PEPs)

Basically, we want one function that operates for any PEP you give it, and it should be capable of accessing rich biological information at any level.

Technical details and code

Pseudo-code of the current implementation looks like this:

for attr in project_dict
    if any([key_word in attr for key_word in self.keywords]):
        desc += project_level_dict[attr] + " "

We need this to be more flexible and more intelligent. For example, if we have a project yaml/dict that looks like:

name: GSE226825
pep_version: 2.1.0
sample_table: GSE226825_PEP_raw.csv
sample_modifiers:
  append:
    sample_data_processing: Adapter sequences were trimmed by Trimmomatic (v0.39).
      Trimmed reads aligned using HISAT2 (v2.2.0) with referring hg19 genome. Aligned
      reads are sorted by samtools (v1.9)...
    sample_extract_protocol_ch1: "RNA was extracted from ...
experiment_metadata:
  series_type: Expression profiling by high throughput sequencing
  series_title: RNA sequencing of peripheral blood mononuclear cells isolated from
    Korean patients with asthma
  series_status: Public on Mar 08 2023
  series_summary: More than 200 asthma-associated genetic variants have been identified
    in genome-wide association studies (GWASs). Expression quantitative trait loci
    (eQTL) ...

We would need to extract out things like, sample_data_processing: and series_summary: since these contain so much information about the data.

This is the exact spot in the code that is mining the description. This is where the magic is happening! Nearly all else in this repo can be thought of as a convenient "glue" that keeps the pipeline going and consistent. The mine_metadata_from_dict function is the only one that needs significant changes (at least for now...)

Secrets, debugging, and Testing

There are three things that you will need for efficient development and testing.

The first is lab secrets. We are working with two databases in this package, as such, there are a handful of secrets and passwords we use to connect to those. This repo is set up to be compatible with the lab secret workflow. If you are setup properly with the lab secret workflow, then you can simply run source production.env and your environment will be populated with the correct credentials. Ask @nleroy917 or @nsheff if you need help here...

The second is debugging. I also have this repository setup to function with VSCode debugging. By hitting F5, you can launch the debugger, and you should then be able to use breakpoints to stop the code and inspect things.

The third is testing. I have a tests/ directory, but it doesn't contain anything 😅. The best way to test currently is by installing the package locally with pip install (pip install .), and then just running the cli: pepembed. You can speed things up by limiting the results from the database: pepembed -n 100.

Extras

In addition, we discussed in meeting that we should have multiple vectors for each object that is stored inside the Qdrant collection. Here is a blog post that explains how to do just that with Qdrant.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.