In this next version, let's use the tagging syntax used by n

Some more things to add to this version: <li class

<input type="checkbox" id="" disabled=""

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Forth version (using nltk) about genestorian_data_refinement HOT 6 CLOSED

manulera commented on August 23, 2024

Forth version (using nltk)

from genestorian_data_refinement.

Comments (6)

manulera commented on August 23, 2024

Some more things to add to this version:

Create a folder in the root directory of the repo called analysis, in it:
- Add the script that runs this new pipeline: fourth_version_pipeline.py such that it can be run like this:
```
python fourth_version_pipeline.py ../Lab_strains/dey_lab/strains.tsv
```
  It should still save the output files in the directory where strains.tsv is.
  
  To use the arguments use the if __name__ == "__main__": syntax in the script. To write the tests, simply import the main function from the script or other functions.
In the new file, put all function definitions at the top, and then the code that calls them in the main function. This helps readability. main function should start with strain_list = build_strain_list('strains.tsv') (more or less)

from genestorian_data_refinement.

manulera commented on August 23, 2024

Fix the typo of occurrences (it's two c and two r) on functions and filenames.
Write format.py for all labs. Some will require further editing, you can ask me about them once you check their strain lists. If you need to do some pre-processing, do it like in the example in tran_lab/format.py and store it in an intermediary excel sheet called post_processed.xlsx like in the example. Like that you can call excel_to_tsv on that one.
Move data/strains.tsv to Lab_strains/nbrp_strains, rename it to strains_raw.tsv and write a format.py for that one as well. Commit the strains_raw.tsv file, since the data in it is public and we will use it for the documentation below.
delete trans_lab folder (the correct name is tran_lab)
Run your pipeline in all the Lab_strains folders and see that it does not fail for any of those
Have a look at the common occurrences across labs and see if you can identify more patterns. When you do, add them to the appropriate allele_components toml file. If some don't fit anywhere, add them to the previous google doc and we can discuss.

from genestorian_data_refinement.

manulera commented on August 23, 2024

Revise your code, and add some comments in the parts that you think will be harder to follow if you are looking at them for the first time. If you struggle understanding what a part of the code does, try to see if you can improve variable names, or add some concise comments.
Add a section to the readme called running the pipeline to explain what the scripts do, and the outputs the produce. Taking the Lab_strains/nbrp_strains It should cover:
- What format.py does and what you would have to do to write your own format.py for your spreadsheet.
- What the analysis script does, and for each output file, what is in the file.

from genestorian_data_refinement.

manulera commented on August 23, 2024

Hello Anamika, I noticed that to read the trees into nltk, we would need to change the output. For example, if we have ase1::NatR:

# The output currently looks like this:
[['ase1', 'GENE'], ['NatR','marker']]
# But it should look like this (note how the value of the strings is in a list):
[['GENE', ['ase1']], ['marker', ['NatR']]]
# So that we can make a tree from it:
Tree(*['GENE', ['ase1']])
# or
Tree('GENE', ['ase1'])

You will need to fix this in the pipeline and tests, sorry that we did not pick it up before. Note that the reason for this is to be able to capture nested things like p-Gene:

from nltk.tree import Tree
Tree(*["GENE", ["ase1"]])
Tree(*['PROMOTER', ['p', Tree(*['GENE', ['ase1']])]])

from genestorian_data_refinement.

anamika-yadav99 commented on August 23, 2024

Hello Anamika, I noticed that to read the trees into nltk, we would need to change the output. For example, if we have ase1::NatR:
# The output currently looks like this:
[['ase1', 'GENE'], ['NatR','marker']]
# But it should look like this (note how the value of the strings is in a list):
[['GENE', ['ase1']], ['marker', ['NatR']]]
# So that we can make a tree from it:
Tree(*['GENE', ['ase1']])
# or
Tree('GENE', ['ase1'])
You will need to fix this in the pipeline and tests, sorry that we did not pick it up before. Note that the reason for this is to be able to capture nested things like p-Gene:
from nltk.tree import Tree
Tree(*["GENE", ["ase1"]])
Tree(*['PROMOTER', ['p', Tree(*['GENE', ['ase1']])]])