pip install --user --upgrade https://github.com/pepkit/pepkit/archive/master.zip
pip install --user --upgrade https://github.com/pepkit/pepkit/archive/dev.zip
Indexer of pephub database
We need to update the function that mines a description from PEPs inside pephub for subsequent embedding and insertion into Qdrant. Currently, the pipeline looks inside each PEP and attempts to extract any project-level attributes. These are typically attributes that describe the data from a global perspective. See here for more info on the architecture. It would be better if we extracted all meaningful text from the PEP and used that to compute an embedding for vector search and retrieval. Example:
Bad:
>>> desc = mine_description(pep)
>>> desc
>>> "Asthma study"
Good:
>>> desc = mine_description(pep)
>>> desc
>>> "More than 200 asthma-associated genetic variants have been identified
in genome-wide association studies (GWASs). Expression quantitative trait loci"
There are two main design goals for the updated pepembed
description mining functionality:
Basically, we want one function that operates for any PEP you give it, and it should be capable of accessing rich biological information at any level.
Pseudo-code of the current implementation looks like this:
for attr in project_dict
if any([key_word in attr for key_word in self.keywords]):
desc += project_level_dict[attr] + " "
We need this to be more flexible and more intelligent. For example, if we have a project yaml
/dict
that looks like:
name: GSE226825
pep_version: 2.1.0
sample_table: GSE226825_PEP_raw.csv
sample_modifiers:
append:
sample_data_processing: Adapter sequences were trimmed by Trimmomatic (v0.39).
Trimmed reads aligned using HISAT2 (v2.2.0) with referring hg19 genome. Aligned
reads are sorted by samtools (v1.9)...
sample_extract_protocol_ch1: "RNA was extracted from ...
experiment_metadata:
series_type: Expression profiling by high throughput sequencing
series_title: RNA sequencing of peripheral blood mononuclear cells isolated from
Korean patients with asthma
series_status: Public on Mar 08 2023
series_summary: More than 200 asthma-associated genetic variants have been identified
in genome-wide association studies (GWASs). Expression quantitative trait loci
(eQTL) ...
We would need to extract out things like, sample_data_processing:
and series_summary:
since these contain so much information about the data.
This is the exact spot in the code that is mining the description. This is where the magic is happening! Nearly all else in this repo can be thought of as a convenient "glue" that keeps the pipeline going and consistent. The mine_metadata_from_dict
function is the only one that needs significant changes (at least for now...)
There are three things that you will need for efficient development and testing.
The first is lab secrets. We are working with two databases in this package, as such, there are a handful of secrets and passwords we use to connect to those. This repo is set up to be compatible with the lab secret workflow. If you are setup properly with the lab secret workflow, then you can simply run source production.env
and your environment will be populated with the correct credentials. Ask @nleroy917 or @nsheff if you need help here...
The second is debugging. I also have this repository setup to function with VSCode debugging. By hitting F5, you can launch the debugger, and you should then be able to use breakpoints to stop the code and inspect things.
The third is testing. I have a tests/
directory, but it doesn't contain anything ๐
. The best way to test currently is by installing the package locally with pip install
(pip install .
), and then just running the cli
: pepembed
. You can speed things up by limiting the results from the database: pepembed -n 100
.
In addition, we discussed in meeting that we should have multiple vectors for each object that is stored inside the Qdrant collection. Here is a blog post that explains how to do just that with Qdrant.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.