manulera / genestorian_data_refinement Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 3.0 1.1 MB

A project to extract genotype information from lab spreadsheets.

License: MIT License

JavaScript 1.76% Shell 1.38% Jupyter Notebook 28.27% Python 65.30% Dockerfile 0.33% HTML 2.98%

genestorian_data_refinement's People

Contributors

Stargazers

Watchers

Forkers

anamika-yadav99 ranbir7 riddhisera

genestorian_data_refinement's Issues

Third version of the pipeline

I have moved these to #22 , as I think they belong there:

Some minor changes:

In the excel_to_tsv function, use lowercase for column names strain_id, genotype.

Change format.py in dey_lab so that it uses the new excel_to_tsv function.

Modify the code so that it uses the new column names in strains.tsv

Modify test_second_version accordingly, and re-run the tests to see that they still pass.

In alleles.json:
- Add an extra field to each allele_feature in addition to name and feature_type named coords. This should contain the coordinates of the match in the allele string. For example:

{
      "name": "nup61-mneongreen:hph",
      "pattern": "GENE-TAG:MARKER",
      "allele_features": [
         {
            "name": "nup61",
            "feature_type": "GENE",
            "coords": [0,5]
         },
         {
            "name": "mneongreen",
            "feature_type": "TAG",
            "coords": [6,16]
         },
         {
            "name": "hph",
            "feature_type": "MARKER",
            "coords": [17,20]
         }
      ]
   }

See that for this allele, name[0:5] == 'nup61', name[6:16] == 'mneongreen', etc.
Sort the allele_features by their order of appearance using the value in coords. See https://stackoverflow.com/questions/11850425/custom-python-list-sorting.
Write a new file test_third_version that
- Tests that for all coords in allele_features, the substring produced using the coordinates matches the allele_feature name, such as name[6:16] == 'mneongreen'
- That allele coordinates are sorted according to the first value in coords.
Some extra code to help us refine the pipeline:
- Write a function that finds the strings that are not part allele features, and writes a tsv file where the first column is the string, and the second one is the number of times this particular string appears, and the lines are sorted by the number on the second column:
  - For example, if there were only two alleles nup61-mneongreen:hph and ase1-mCh-hph, the file should look:

-	3
:	1

To implement this, you can see the use of the library Counter in notebooks/second_notebook

Add strain lists from labs to the gitignore

Since the strain lists will be in the directories in Lab_strains, add them to the gitignore.

Use toml to build the list of allele components

See https://github.com/manulera/genestorian_data_refinement/tree/master/toml_example

Put common parts in a module

Move the common code from analyse.py scripts to a common module.

Substitution of patterns should take into account the length of the target sequence

This is something that I mentioned in the comments in the code, but not sure if it was clearly formulated. Let's imagine we want to find the marker NatMx6, that can appear in different ways: NatMx, NatR or simply Nat.

Clearly, when we substitute we should try the different strings from the longest one (NatMx6) to the shortest one (Nat). The problem comes when we use regular expressions. The regex may be long, but the pattern matched by it shorter, so a simple sorting of the list of terms by length would not work. I guess there is a way to store all possible matches, and keep always the matches that match the longest part of the string? Something to keep in mind

This does not apply to the example from the word document of nat where we want to say we only want to replace nat when it is surrounded by non-letter non-number characters, because it would not match on the longest strings here.

Retreiving fluorescent protein data from a public API

Hi Anamika,

I think we should retrieve most of the fluorescent protein data from www.fpbase.org. They have quite an extensive list, and they also have aliases of proteins. For this, you should write a script in get_data, call it get_fpbase_data.py or something like that. This script should make a request to their graph database and get the full list of fluorescent proteins and their aliases, and convert them to a toml file in the same format as allele_components/tags.toml, but stored in data/tags.toml. We should keep an allele_components/tags.toml anyway, as there are some tags that are not in this database (the non-fluorescent ones). There is currently no example, but you can delete all the examples listed and add these two:

[tag.myc]
name = 'myc'
ref = ''

[tag.flag]
name = 'flag'
ref = ''

The very short documentation for the api entrypoint is here: https://help.fpbase.org/api/introduction#graphql

In this link you can test your queries interactively before writing code: https://www.fpbase.org/graphql/

I let you figure out the query for yourself, but the data conversion from API to toml should be:

API	toml
proteins.name	name
proteins.aliases	synonyms
proteins.primaryReference.doi	reference

In summary:

Write the script to save the data in data/tags.toml
Add it to gitignore (it should not be committed, instead generated before the analysis calling the script)
- Document this in #30
Delete redundant tags in allele_components/tags.toml and add the new ones
Use both tags.toml in the analysis (at some point we will merge them, but let's leave that for now)

Create your own blog

This can wait until we figure out the first version of the code, to work with the Dey strains.

I would recommend hugo: https://gohugo.io/

genestorian's site is built using hugo: https://github.com/manulera/GenestorianLandingPageSource
you could use this repository as a template for github actions

create your own website for blog
publish blog - 1(due - 26th June)
publish blog - 2 (due - 10th july)
publish blog - 3 (due -24th July)
publish blog - 4 (due - 7th Aug)
publish blog - 5(due- 21st Aug)
publlish blog - 6(due - 4th sept)
publisg final blog(due - 15th sept)

Merge master into your GSOC branch

To have access to the toml files in folder allele_features and to get_data/convert_alleles2toml.py. You should do this after you finish with #16

Check this docs

https://tedboy.github.io/nlps/generated/generated/nltk.RegexpParser.html

Switch from notebooks to scripts

This is more convenient for version control as we mentioned during the call. Also, you can also use them in interactive mode, like in Jupyter:

https://youtu.be/lwN4-W1WR84?t=130

Second version of the pipeline

In this new version, we should aim to produce two json files:

strains.json contains a list of objects representing strains (one for each row in strains.tsv).

For a line in strains.csv like this:

Sample Name	Genotype
ID123	h- cls1-36 ase1D:NatMx

[
    {
        id: 'ID123', // This should be a unique identifier for the strain in the file, the `sample name` field in the dey lab.
        genoytpe: 'h- cls1-36 ase1D:NatMx', // The genotype as listed in the tsv file
        mating_type: 'h-', // This can only be h- / h+ / h90
        alleles: [
            'cls1-36',
            'ase1D:NatMx'
        ]
    }
]

An alleles.json file that contains a list of found alleles, and what you have substituted in them. There should be only one entry per allele name: if the same allele name is present twice (e.g. if two strains have ase1D:NatMX in them), there should be only a single entry for the allele in alleles.json.

[
    {
        name: 'cls1-36', // The name of the allele as written in the genotype
        pattern: 'ALLELE', 
        allele_features: [
            {
                name: 'cls1-36',
                feature_type: 'ALLELE'
            }
        ]
    },
    {
        name: `ase1D:NatMx`,
        pattern: 'GENEd-MARKER',
        allele_features: [
            {
                name: 'ase1',
                feature_type: 'GENE'
            },
            {
                name: 'NatMx',
                feature_type: 'MARKER'
            }
        ]
    }
]

How you should go about this:

To make sure that your function works for the example I provided, create an example tsv file with this content and test your code there:
```
Sample Name	Genotype
ID123	h- cls1-36 ase1D:NatMx
```
First, create the function that produces strains.json, since this does not require finding any pattern.
Then, create the first version of the function that produces alleles.json, but only with the name field.
- Since there should be only one entry per allele, to avoid repetition, you can store your alleles in a set.
- Then you can iterate over the set and create a dictionary for each allele with {name: allele_name}
Continue on the function that produces alleles.json:
- Iterate along the set to create the complete dictionary (call replace_allele_features on each allele, and generate a dictionary like the one in the example)
- Don't forget to also extract the mating type.

First version of the pipeline

This first version should have the following:

Formatting for the Dey lab strains:
- A python script in Lab_strains/dey_lab/format.py that takes Manu_strains.xlsx as an input and returns a file Lab_strains/dey_lab/strains.tsv in which the first column is the id of the strain e.g.GD172 and the second column is the phenotype.
Replacing allele names in the Dey lab collection:
- A python script that:
  - Takes Lab_strains/dey_lab/strains.tsv and data/alleles.toml as an input and replaces the allele names by the word ALLELE and prints all the substituted genotypes to a text file, one per line.
  - For now, no need to format this output, we can do it in next iterations.

You can push your code after the first task, so that I can review it already, and then the second one.

Add openpyxl to poetry dependencies

When adding new dependencies to the project, you have to add them to poetry, so that when I download the new version of the code, I can run it. The optional dependency openpyxl for pandas was missing, so I could not run the notebooks. To add new dependencies (in the root folder):

poetry add dependency_name

Run tests

Hello I have added some tests that should allow you to check whether you have addressed the two points that I mentioned in the email. You will have to pull my changes top run them. To run them in the command line:

# activate poetry environment
poetry shell
# in the directory Lav_strains/dey_lab
python -m unittest

Please read carefully the guidelines in the #22 issue, and you should be able to get these two tests working, then you can move on to the rest.

Make function/script that returns json file of patterns and occurrences

Something in these lines:

{
    "GENE-TAG-MARKER 32": ["ase1-GFP-mCherry", ]
}

Small API

HI @anamika-yadav99, I am thinking to pick up on the project in a few weeks, and I think it would be nice to have a simple web interface to quickly show in the web browser what the pipeline does.

The first step towards that is making an API entrypoint where the user would make a request with the allele name, and get back the json of the pattern. For example:

User input:

{
  "name": "cut11-mch:ura4+"
}

Response:

{
  "name": "cut11-mch:ura4+",
  "pattern": [
      ["GENE", ["cut11"]],
      ["-", ["-"]],
      ["TAG", ["mch"]],
      ["-", [":"]],
      ["ALLELE", ["ura4+"]]
    ]
}

This is relatively easy to do using FastAPI. I have made a file to run the API, so that you only have to write a function that takes the allele name and returns the pattern, and that should work. Your function would have to be called here:

genestorian_data_refinement/api.py

Lines 37 to 57 in 5d64d6e

    
           @ app.post("/process_allele", response_model=ProcessAlleleResponse) 
        
           async def check_allele(request: ProcessAlleleRequest): 
        
               # This variable contains the allele name, in the example cut11-mch:ura4+ 
        
               allele_name = request.name 
        
               # Here is where your function would take the allele name and return the pattern 
        
               # allele_pattern = your_function(allele_name) 
        
               # For now there is this dummy allele_pattern, you can comment this when 
        
               # you have added your function: 
        
               allele_pattern = [["GENE", ["cut11"]], ["-", ["-"]], 
        
                                 ["TAG", ["mch"]], ["-", [":"]], ["ALLELE", ["ura4+"]]] 
        
               # Here you instantiate the response object 
        
               response = ProcessAlleleResponse( 
        
                   name=allele_name, 
        
                   pattern=allele_pattern 
        
               ) 
        
               return response

Where it says your_function.

You can run the api in your computer very easily and try the dummy request following these instructions:

Install dependencies

The easiest would be to clone the repo again, I think:

git clone https://github.com/manulera/genestorian_data_refinement
git checkout -b api
git pull origin api

Then install dependencies and activate environment

poetry install
poetry shell

Remember also to download the necessary files (see https://github.com/manulera/genestorian_data_refinement#strain-lists)

Running and using the API

To run the API locally:

uvicorn api:app --reload

The you go to http://127.0.0.1:8000/ and you should see the api documentation, where you can make a test request. Go to /process_allele and click on try it out. This will allow you to add an example json, by default the one from above. Then you click on the "execute" button and you should see the Response body, which is what the function check_allele returns.

What to do

Replace the function your_function by a function that takes the allele name and returns the pattern, and see if it works when you make a request from the api.

Convert gene_IDs_names to toml

Create a script get_data/convert_genes2toml.py that writes the file data/genes.toml from data/gene_IDs_names.tsv, the same way that get_data/convert_alleles2toml.py writes data/alleles_pombemine.tsv.

What the fields are is explained in the readme.

For a row in the file, such as:

SPAC1250.01	snf21	SPAC29A4.21,brg1

The output should be

# Note that here the gene name is doublequoted, because it contains a dot, so that toml doesn't think we 
# are specifying a subclass. You don't have to specify this in python, you can use 'SPAC1250.01' as a dictionary
# key, and it will know to format it like this for the output.
[gene."SPAC1250.01"] 
ref = "SPAC1250.01"
# "name" will be empty sometimes. Not all genes have a preferred name.
# This should be taken into account when replacing gene names
name = "snf21"
# This field will be missing sometimes (not all genes have synonyms).
synonyms = ["SPAC29A4.21", "brg1"]

Some thoughts for a future version using nltk

Hi @anamika-yadav99

I have been giving some thought to how to do the semantic patterns, as shown in allele_components/other.toml.

Ultimately, the goal is to understand what are the elements in each allele, for an allele of the form GENE-TAG-MARKER, the allele is actually the sequence of the GENE, the sequence of the TAG and the sequence of the MARKER one after the other, so there's not much to do in such case.

In other cases however, such as gene deletion, the situation can be different. Let's take the simple example of the promoter: MARKER::pGENE-GENE means that the promoter of the second gene has been replaced by the promoter of the first gene, e.g. KanMX::pase1-klp9, we have replaced the promoter of klp9 by the promoter of ase1. The pipeline should understand that this is the promoter of a gene, and not the gene itself based on finding pGENE.

As you said, it is not obvious what the best way to go about this is, and how to deal with more nested patterns that we will for sure find. I was digging a bit on what Anika sent us, and I think some of the objects from nltk might be useful for what we are trying to do, see the small example below:

from nltk.tree import Tree

# An example for ase1::NatMx klp9-mCherry::KanMx


allele_1 = Tree('ALLELE', [
                Tree('GENE_DELETION', [
                    Tree('GENE', ['ase1']),
                    Tree('SPACER', ['::']),
                    Tree('MARKER', ['NatMx'])
                ]
                )])

allele_2 = Tree('ALLELE', [
                Tree('GENE', ['klp9']),
                Tree('SPACER', ['-']),
                Tree('TAG', ['mCherry']),
                Tree('SPACER', ['::']),
                Tree('MARKER', ['KanMx']),
                ])

genotype = Tree('GENOTYPE',
                [allele_1, Tree('SPACER', [' ']), allele_2]
                )

genotype.draw()

This produces the following graph:

There must be some object in the library ntlk where we can define rules to identify those semantic patterns once we have done the first round of substitutions. We can discuss this in a call.

Understanding tests

I will add some more test cases now. As a part of understanding a bit better how to write tests before you start writing your own, let's first do this excercise:

Write a markdown document where, for each test function, you describe what each assertion is testing. For example, for the first test function test_that_test_files_are_there, you should do something like this:

test_that_test_files_are_there:
- self.assertTrue(os.path.isfile('./strains_test.tsv')) tests whether the file strains.tsv file exists.

Do this for each function, for each of the assertions. This will help you understand the logic of the testing before writing your own tests for the next version. If you have doubts on why one of the assertions is tested, let me know.

Generalise allele feture subsitution from toml files

Do this after #17 so you have the genes in toml file as well.

In the script you have written to extract the data from the Dey lab, add a function that:
- Takes as an input:
  - A list of genotypes.
  - A toml file with allele features (genes, alleles, tags, etc.).
  - A string that will be used to replace the allele features.
- Replaces the allele features defined in the toml file by the string you pass as third argument (see example below)

Remember to replace the longest names or synonyms first, so that kanMx6 gets replaced before kan etc.

Remember to make everything lowercase when replacing strings.

The function may look like this:

def replace_allele_features( toml_file, genotypes, word):
    code

genotypes = ['cls1-36 ase1-GFP:KanMx6']
# Replace all the allele features included in alleles.toml by the string ALLELE
genotypes2 = replace_allele_features('data/alleles.toml', genotypes, 'ALLELE')
#genotypes2: ['ALLELE ase1-GFP:KanMx6']

# Replace all the genes included in genes.toml by the string GENE
genotypes3 = replace_allele_features('data/alleles.toml', genotypes2, 'GENE')
#genotypes3: ['ALLELE GENE-GFP:KanMx6']

Then try the code for alleles and genes as in the example above

First version of the output format

A first version of the output format may look like this. We could have a json array, where each strain could look like this:

Example for a strain with Y129, and genotype h- ase1-mCherry::KanMx NatMx::pnmt1-GFP-mal3 cls1-36 klp9D::hph

{
    "id": "Y129",
    "genotype": "h- ase1-mCh::KanMx NatMx::nmt1-GFP-mal3 cls1-36 klp9D::hph",
    "mating_type": "h-",
    "alleles": [
        {
            "name": "ase1-mCh::KanMx",
            "locus": "SPAPB1A10.09",
            // This should appear in the order they appear within the allele
            "allele_components": [
                {
                    "given_name": "ase1",
                    "real_name": "ase1",
                    "type": "wild-type gene sequence",
                    "reference": "SPAPB1A10.09"
                },
                {
                    "given_name": "mCh",
                    "real_name": "mCherry",
                    "type": "tag",
                    "reference": "https://www.fpbase.org/protein/mcherry/"
                },
                {
                    "given_name": "KanMx",
                    "real_name": "KanMx",
                    "type": "marker",
                    // Maybe we can find a reference for this as well, not sure, does not matter much I think
                    "reference": "???"
                },
            ]
        },
        {
            "name": "NatMx::nmt1-GFP-mal3",
            "locus": "SPAC18G6.15",
            // This should appear in the order they appear within the allele
            "allele_components": [
                {
                    "given_name": "NatMx",
                    "real_name": "NatMx",
                    "type": "marker",
                    "reference": "???"
                },
                {
                    "given_name": "nmt1",
                    "real_name": "pnmt1",
                    "type": "promoter",
                    "reference": "???"
                },
                {
                    "given_name": "GFP",
                    "real_name": "GFP",
                    "type": "tag",
                    // This reference might be wrong, we will figure out, but not very important at this point.
                    "reference": "https://www.fpbase.org/protein/megfp/"
                },
                {
                    "given_name": "mal3",
                    "real_name": "mal3",
                    "type": "wild-type gene sequence",
                    "reference": "SPAC18G6.15"
                },
            ]
        },
        {
            "name": "cls1-36",
            "locus": "SPAC3G9.12",
            // This should appear in the order they appear within the allele
            "allele_components": [
            {
                "given_name": "cls1-36",
                "real_name": "cls1-36",
                "type": "modified gene sequence",
                // We still don't have unique identifiers for alleles in pombase, but since there can only be one gene named the same for each allele, for now, we ca
                "reference": "SPAC3G9.12"
            }
            ]
        },
        {
            "name": "klp9D::hph",
            "locus": "SPBC15D4.01c",
            // Note that this one has no gene sequence, so we only put the marker
            "allele_components": [
            {
                "given_name": "hph",
                "real_name": "HphMx",
                "type": "marker",
                "reference": "???"
            }
            ]
        }
    ]

}

Fix dependencies

Hello,

I had a look at the code you submitted, some things need fixing:

Fix poetry environment. You introduced an error by changing the folder name, but not changing the dependencies. I understand you are not using any functions from that module now, but you will add some in the future, so it's important to fix it at this stage.
- poetry.lock expects that there is a module genestorian_data_refinement in the folder, but you have renamed it, so it's no longer there. If you run poetry install, the command will fail. What you need to make sure you have fixed it:
  - Delete your .venv
  - Delete your poetry.lock
  - Re-add your module with develop=true as in this example.
  - Fix the name of module in setup.py
    
    genestorian_data_refinement/genestorian_module/setup.py
    
    Line 2 in c4e8e39
    
    setup(name='genestorian_data_refinement', packages=find_packages())
  - Remove the directory genestorian_module/genestorian_data_refinement.egg-info
  - Remove the directiory genestorian_data_refinement if you still have it.
  - Run poetry install. If it works, then it means you have fixed the dependencies

I will write some tests so that you get a warning if you have made mistakes with the environment in future commits. This is really important because I can't even start running your code if the environment does not work.

Create Github action workflow

Forth version (using nltk)

In this next version, let's use the tagging syntax used by nltk.

For example, for pfus1-ase1-mCherry, we now would generate: pGENE-GENE-TAG, in their format it should be:

['p',('fus1', 'GENE'),('-','-'),('ase1', 'GENE'),('-','-'),('mCherry','TAG')]

Initially I thought we could store this as plain text, but I think it's not worth the headache of escaping characters (the default separator is \ but I am sure we will encounter every thinkable separator, I would stick with json, something like this for each allele)

[
{
  "name": "pfus1-ase1-mCherry",
  "pattern": ["p",["fus1", "GENE"],["-","-"],["ase1", "GENE"],["-","-"],["mCherry","TAG"]]
}
]

So for this version, write some code that exports the alleles to this, and a test that verifies that it works from a few examples. To restore our previous patterns, which are also useful to generate the occurrences file, you can do something like this:

"".join([i if type(i)==str else i[1] for i in pattern])

For the separators, make a list of known separators in a text file, for now I imagine it's mostly :-, but we will have to add more in the future. Any subsequent separators in an allele name should be counted as one, so use regex. They will have to be replaced after all other features, as some allele names contain separators in them. We could name the feature "SEPARATOR" but I think for readability is better to use "-" and this is supported by nltk.

	@ app.post("/process_allele", response_model=ProcessAlleleResponse)
	async def check_allele(request: ProcessAlleleRequest):

	# This variable contains the allele name, in the example cut11-mch:ura4+
	allele_name = request.name

	# Here is where your function would take the allele name and return the pattern
	# allele_pattern = your_function(allele_name)

	# For now there is this dummy allele_pattern, you can comment this when
	# you have added your function:
	allele_pattern = [["GENE", ["cut11"]], ["-", ["-"]],
	["TAG", ["mch"]], ["-", [":"]], ["ALLELE", ["ura4+"]]]

	# Here you instantiate the response object
	response = ProcessAlleleResponse(
	name=allele_name,
	pattern=allele_pattern
	)

	return response