Giter Site home page Giter Site logo

project1's Introduction

Project 1

Building a FAST[AQ] Parser + Building a DNA -> RNA Transcriber

Assignment Overview

The purpose of this assignment is to bring everybody up to speed on using python and highlight some of the elements of the language and of OOP that will be inherited by downstream course content. We also hope to use this as an opportunity to learn some of the useful tools within github for software development that will be useful to you in your research experiences down the road.

The learning goals of this project are:

1. object oriented programming
2. python language skills
3. modular programming / predetermined APIs
4. best practices for reproducible code development

Assignment Tasks

Coding Assessment

  • Write a Fasta Parser (Write this code in the parse.py file)
  • Write a Fastq Parser (Write this code in the parse.py file)
  • Write a Transcription Function (Write this code in the seq.py file)
  • Write a Reverse Transcription Function (Write this code in the seq.py file)

Software Development Assessment

  • Create a pip installable tool (only locally, do not submit to PyPI)

    • Edit the pyproject.toml file to reflect author information
    • Update the dependencies to include those that you use in your code
  • Write a unit test (in the test.py file) for

    • Fasta Parser
    • Fastq Parser
    • Transcription
    • Reverse Transcription
  • Automate Testing with a Github Actions

    See blogposts below on helping set up github actions with pytest:

    Ensure that the github actions complete the following:

    • pip installs tool
    • runs pytest

Getting Started

To get started you will need to fork this repository onto your own github. You will then work on the code base from your own repo and make changes to it in the form of commits.

Git Basics

Overview

Git is a version control tool originally built to keep track of the linux kernel. We will expect you to become familiar with it mainly up to the point of forking, adding, committing, and pushing. There are more things you can do with it, but we will leave that up to you to learn if you are interested.

Forking

Forking is a means of copying the existing code base to another "branch" which you can then work on. This leaves the original upstream branch unchanged. This is useful when you are modifying an existing code base for an extra purpose which goes beyond the scope of the original authors. It is also the de facto way to contribute to open source repositories (by forking the existing repo, making your changes, and then merging back into the upstream branch via a pull request). We will not go over pull requests here, but there are plenty of resources online about them if you are interested.

General Pipeline

The general overview of git actions is as follows.

# staging changed files 
git add changed_file.txt changed_file_2.txt

# commiting changed files
git commit -m "this is a message describing my changes"

# pushing to the repository (upstream branch)
git push

Reference Information

Test Data

The data we will be testing with will be single-line FASTA files and single-line FASTQ files. This means that the entire sequence will be on one line and you don't need to implement a multi-line FASTA/FASTQ parser. We've included some test data under data/test.f[qa] which can be used to validate your code as you are writing it. If you don't like these test data though you can make your own by changing the seed on the tool data/make_seq.py and rerunning the code.

FASTA File Format

I am sure you have seen this before, but for those who have not, the FASTA file format is a plaintext representation of sequencing data. Some FASTA representations include multiple lines / sequence, but more often than not you will find the format only with 1 sequence per line so that is the format we will use. Here is an example of 3 sequencing records.

FASTA Representation

>Header
Sequence

Example FASTA

>sequence_1
ACGGACCACCATGAA
>sequence_2
ACGGACCTGAA
>sequence_3
ACGGACCGGATTAACCATGAA

FASTQ File Format

The FASTQ file format is very similar to the FASTA records, but it includes 2 extra lines per record. The only added information is the quality score, which will look like a computer's stream of consciousness but is in fact the confidence that a base is the base called. If you are interested in that process, take a look at PHRED scoring, if you are not that is ok too

FASTQ Representation

@Header
Sequence
+
Quality

Example FASTQ

@seq0
TGTGGTCGTATAGTTATTGTCATAAATTACACAGAATCG
+
*540($=*,=.062565,2>'487')!:&&6=,6,*7>:
@seq1
CCCCGGACGACTGATCCCGATAGAGCTCACTCTTCGAGG
+
'(<#/0$5&!$+,:=%7=50--1;'(-7;0>=$(05*9,
@seq2
GATAAACTTCTATCACGAATACTGCGGGACCATGCAGTT
+
1,758$,:7654/7<0%5/12%-3>-2.>$$443-,'9,

Transcription

Regular Transcription

Despite the hamming distance of their names, DNA and RNA are completely different molecules. While practices handling these materials in the wetlab are incredible different, the largest difference between them in the drylab is the replacement of the thymine: T for a uracil: U.

input  : ACTGAACCC
output : ACUGAACCC

Reverse-Transcription

This process is very similar to the transcription process except that the reverse transcription process is mimicing the reverse strand of the nucleic acid. However, to a computer scientist, this process can be thought of as the nesting of functions: reverse(transcribe(sequence))

input  : A C T G A A C C C
         | | | | | | | | |
output : U G U C U U G G G

Github Actions

These are really useful tools for software development and are a good way to test if you are adding breaking changes to a code base. They are a must for open source projects, but even in my own moonshine-esque bioinformatic exploits they have proven invaluable.

The idea behind them is that you can set up github to do something when you do something. If that sounds vague please know that it is merely abstracted; and for good reason! You can set these up to do quite a bit of things and test quite a bit of things in response to multiple conditions being met (i.e. merging branches, publishing packages, alerting failures, etc.). You can get into the weeds with metaprogramming and DevOps nightmares with this, but knowing a little can be really useful in distributing your code to other lab members or collaborators and is something that is useful to learn.

For the purposes of this assignment we are interested in responding to git push commands. We will be writing a github action that will test the installation of our package and then run our unit tests to make sure they pass.

Github already offers some boilerplate actions that can be found under the Actions tab on the webpage of your repo. It'll recognize the *.py files in the repository and offer you some python related options. There isn't one prebuilt for what we're asking for the assignment, but either of the 4 options provided as defaults could be easily modified to run the installation and the tests - but the PyLint one is the most similar.

You will need to run the following commands in your YAML file, but the implementation details are up to you

pip install . 
pytest -v

Python Modules

pip is the go to installer for python packages. You may have heard of conda before, which is an environment manager and a package installer, and it is very useful for defining specific environments that don't break as years go by. conda is a useful tool, but is beyond the scope of this class. Frankly, pip is beyond the scope also in a 10 week course, but writing shareable code is important and learning how to write a standalone module makes that process much easier.

There are a lot of details and documentation behind making a python module, but the skinny is that at minimum you need to tell pip (the installer) what your packages is, the minimum requirements, and the module dependencies. You put this information into a file: pyproject.toml or setup.py and then you can install the tool into an existing environment (whethers its conda managed or not).

The original means of installing was by using the setup.py file and specifying information that way, but this is no longer the best practice and it has moved to using a pyproject.toml framework. However, both are supported, and it is not a requirement to use one or the other for this class.

setup.py

You can read more about this here Here is an example setup.py

from setuptools import setup

setup(
    name= #NAME_OF_PACKAGE,
    version= #VERSION_OF_PACKAGE,
    author= #WHO_ARE_YOU,
    author_email= #HOW_DO_I_YELL_AT_YOU,
    packages= [
		#WHAT_DO_I_CALL_THIS_THING
		],
    description= #WHAT_IS_THIS_THING,
	install_requires= [
		#WHAT_DOES_THIS_NEED_TO_WORK
		]
)

pyproject.toml

you can read more about creating packages easily with flit

here is an example pyproject.toml

[build-system]
requires = [
	"flit_core >=3.2,<4",
	"python_version >= '3.7'"
	]
build-backend = "flit_core.buildapi"

[project]
name = #NAME_OF_PACKAGE
authors = [{
	name = #WHO_ARE_YOU,
	email = #HOW_DO_I_YELL_AT_YOU
	}]
readme = "README.md"
license = {file = "LICENSE"}
classifiers = ["License :: OSI Approved :: MIT License"]
dynamic = ["version", "description"]
dependencies = [
	#WHAT_DOES_THIS_NEED_TO_WORK
]

Installation

If everything works as its supposed to you will now have a python module you can use!

The following code is how you would install it:

pip install NAME_OF_PACKAGE

Unit Testing

The testing framework we are going to use in grading is pytest. We recommend you also learn how to use pytest when making your unit tests, but you can use whatever testing framework you want. pytest is easy to use though and can be setup pretty quickly. You can read the documentation for pytest here

The main idea is that you want to test your code with assertions. These assertions must always be true! If they are broken, then your code is not doing what it's supposed to be doing.

Here is an example of a unit test:

def add_numbers(x, y):
    return x + y

assert add_numbers(2,3) == 5
assert add_numbers(2,3) != 0

pytest will by default recursively search for functions to test in all tests/test*.py files that meet the regex test*.

Here is an example of a testing script that will test two functions once pytest is run:

from module import add_numbers

def test_module_correct():
    assert add_numbers(2,3) == 5

def test_module_incorrect():
    assert add_numbers(2,3) != 0

project1's People

Contributors

noamteyssier avatar jzhang1198 avatar pgrosjean avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.