Giter Site home page Giter Site logo

Comments (6)

manulera avatar manulera commented on August 23, 2024

Some more things to add to this version:

  • Create a folder in the root directory of the repo called analysis, in it:

    • Add the script that runs this new pipeline: fourth_version_pipeline.py such that it can be run like this:

      python fourth_version_pipeline.py ../Lab_strains/dey_lab/strains.tsv
      

      It should still save the output files in the directory where strains.tsv is.

      To use the arguments use the if __name__ == "__main__": syntax in the script. To write the tests, simply import the main function from the script or other functions.

  • In the new file, put all function definitions at the top, and then the code that calls them in the main function. This helps readability. main function should start with strain_list = build_strain_list('strains.tsv') (more or less)

from genestorian_data_refinement.

manulera avatar manulera commented on August 23, 2024
  • Fix the typo of occurrences (it's two c and two r) on functions and filenames.
  • Write format.py for all labs. Some will require further editing, you can ask me about them once you check their strain lists. If you need to do some pre-processing, do it like in the example in tran_lab/format.py and store it in an intermediary excel sheet called post_processed.xlsx like in the example. Like that you can call excel_to_tsv on that one.
  • Move data/strains.tsv to Lab_strains/nbrp_strains, rename it to strains_raw.tsv and write a format.py for that one as well. Commit the strains_raw.tsv file, since the data in it is public and we will use it for the documentation below.
  • delete trans_lab folder (the correct name is tran_lab)
  • Run your pipeline in all the Lab_strains folders and see that it does not fail for any of those
  • Have a look at the common occurrences across labs and see if you can identify more patterns. When you do, add them to the appropriate allele_components toml file. If some don't fit anywhere, add them to the previous google doc and we can discuss.

from genestorian_data_refinement.

manulera avatar manulera commented on August 23, 2024
  • Revise your code, and add some comments in the parts that you think will be harder to follow if you are looking at them for the first time. If you struggle understanding what a part of the code does, try to see if you can improve variable names, or add some concise comments.
  • Add a section to the readme called running the pipeline to explain what the scripts do, and the outputs the produce. Taking the Lab_strains/nbrp_strains It should cover:
    • What format.py does and what you would have to do to write your own format.py for your spreadsheet.
    • What the analysis script does, and for each output file, what is in the file.

from genestorian_data_refinement.

manulera avatar manulera commented on August 23, 2024

Hello Anamika, I noticed that to read the trees into nltk, we would need to change the output. For example, if we have ase1::NatR:

# The output currently looks like this:
[['ase1', 'GENE'], ['NatR','marker']]
# But it should look like this (note how the value of the strings is in a list):
[['GENE', ['ase1']], ['marker', ['NatR']]]
# So that we can make a tree from it:
Tree(*['GENE', ['ase1']])
# or
Tree('GENE', ['ase1'])

You will need to fix this in the pipeline and tests, sorry that we did not pick it up before. Note that the reason for this is to be able to capture nested things like p-Gene:

from nltk.tree import Tree
Tree(*["GENE", ["ase1"]])
Tree(*['PROMOTER', ['p', Tree(*['GENE', ['ase1']])]])

from genestorian_data_refinement.

anamika-yadav99 avatar anamika-yadav99 commented on August 23, 2024

Hello Anamika, I noticed that to read the trees into nltk, we would need to change the output. For example, if we have ase1::NatR:

# The output currently looks like this:
[['ase1', 'GENE'], ['NatR','marker']]
# But it should look like this (note how the value of the strings is in a list):
[['GENE', ['ase1']], ['marker', ['NatR']]]
# So that we can make a tree from it:
Tree(*['GENE', ['ase1']])
# or
Tree('GENE', ['ase1'])

You will need to fix this in the pipeline and tests, sorry that we did not pick it up before. Note that the reason for this is to be able to capture nested things like p-Gene:

from nltk.tree import Tree
Tree(*["GENE", ["ase1"]])
Tree(*['PROMOTER', ['p', Tree(*['GENE', ['ase1']])]])

Hi @manulera, I have made the necessary changes to the code and readme. You can review now

from genestorian_data_refinement.

manulera avatar manulera commented on August 23, 2024

Hi @anamika-yadav99 wow that was fast! Good job! Also good job with the readme, it looks much better now!

from genestorian_data_refinement.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.