Giter Site home page Giter Site logo

genome's Introduction

New Outline: Better NLP background research and precise/concise descriptions

Tokenization

Case folding

  • Reduce case of words to lower except for upper-case words in middle of section.
    • General Motors
    • FED vs fed
    • SAIL vs sail
      • Group sequences of consecutive upper case letter in same sentence categorize into one word; General Motors.

Word Normalization

  • U.S.A => USA. Need to specify the equivalence classes of words. An implicit equiv class for this example is deleting periods in a term.

OLD OUTLINE

To do

Bag-of-Words

Get unique words occuring in all text in our corpus and design a vocabulary.

The vocabulary will be a vector of size the length(of unique words in corpus). Each document will be described by a vector w/ entiries equal to the number of occurrence of words in vocabulary.

Problems

-[ ] The vector can be big and sparse so use scipy-sparse vector to represent (which one?)
-[ ] text cleaning techniques to reduce vocab
    - Ignoring case
    - Ignoring punctuation
    - Ignoring frequent words that don’t contain much information, called stop words, like “a,” “of,” etc.
    - Stemming: words to their stem (e.g. “play” from “playing”) using stemming algorithms
    - Fixing misspelled words.

Enhancement

Generalize to bag-of-bigrams model.

*An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like “please turn”, “turn your”, or “your homework”, and a 3-gram (more commonly called a trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”... n-gram*

Notes

- Take a document as the input.
- Read the document line by line
- Tokenize the line (put into a vector)
- Process woords. 
- https://en.wikipedia.org/wiki/Tf%E2%80%93idf
- https://s3.amazonaws.com/assets.datacamp.com/production/course_5064/slides/chapter2.pdf
- https://medium.com/@aakashtandel/the-basics-of-natural-language-programming-a-big-bag-of-words-2f2ac06638ea

** Patent Background **

Break throughs before Gene Therapy

  • Stem cell therapy (1970s)
  • Immunotherapy (1970s)

9-Types of Molecular Scissor (probably more)

  • Cas9 (RNA)
  • TALE (Protein)
  • Group II intron (RNA)
  • Meganuclease (Protein)
  • Recombinase (Protien)
  • TtAgo (DNA)
  • λ-beta/exo MAGE (DNA)
  • RecACage (DNA)
  • ZnF (Protein)

4-Main Types

  • meganucleases
  • zinc finger nucleases ZFNs
  • Transcription activator-like effector-based nucleases TALEN
  • Clustered regularly interspaced short palindromic repeats Cas9/Crispr

Websites

genome's People

Contributors

jdanene avatar

Stargazers

Xiao-Yu Zhou avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.