Giter Site home page Giter Site logo

nlp's Introduction

Project Name: NGrams
Project Description: Implementation of a collection of n-gram-based language model including
computation of unsmoothed unigrams and bigrams for an arbitrary text corpus. Current corpus
used as training is the King James Bible. The Bible corpus is broken up into documents in an 
XML-style format, and the tags have been stripped away and only the text are aggregated in 
developing your language model. The project also includes a random sentence generator for both
unigrams and bigrams. The last part of the project will consist of implementing Good-Turing 
smoothing and add-one smoothing used later to implement the perplexity of a test set. 

Class Descriptions:
Bigrams.java - Implementation of bigrams and the related smoothing features.
INgrams.java - Interface for NGrams. Implemented by bigrams and unigrams.
NgramsInitializer - Used to start the program by loading file to tokenize and
populating bigrams/unigrams Tries.
Tokenizer.java - Processes the corpus and strips away tags and formats it for NGrams. 
Trie.java - The data structure that holds information for NGrams language model.
The Trie implementation has been altered from traditional character Trie to fit 
the tasks of this project better.
TrieNode.java - A node class used to construct the Trie.
TrieTest.java - A JUnit test for testing Trie implementation.
Unigrams.java - Implementation of unigrams and the related smoothing features.
The classes contains WORD_MARKER that specifies end of a word. It can be changed by the user.
 
How to run:
Run main in NgramsInitializer after providing the file paths to print unigrams and bigrams, then
choose the file containing corpus. (The current test set is in the project folder as kjbible.test)
Output of unigrams and bigrams written to files given as arguments to main. 

nlp's People

Contributors

sarahlee429 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.