Giter Site home page Giter Site logo

nlp-text-processor's Introduction

Natural Language Processing: Text Processor

Academic Project - A Text (Pre)Processor program.

Overview:

This commandline program takes input from the user in the form of a plain text document, extracts some textual properties of its contents and outputs the result in a given output text file. Specifically, these extracts contents outputted include:

  • Number of paragraphs.
  • Number of sentences.
  • Number of words (i.e. word "tokens" )
  • Number of distinct words (i.e. "word types").
  • A list of the word frequency counts. Words are ordered by frequency (in the descending order), and words which have the same frequency count are ordered by lexicographical order (in ascending order).

The program makes extensive use of built-in Java language string-processing and regular expression libraries to perform text preprocessing tasks such as sentence-segmentation and other string processing functionality.

Words were tokenized such that:

  • Leading and trailing punctuation marks are separated into individual tokens. For example, "(3A):" is made into four tokens ("(", "3A", ")" and ":"), and "$3.19" is converted made into two tokens ("$" and "3.19").

  • Contractions are separated into individual tokens (without expanding to true words). Although some contractions are ambiguous (e.g. "they'd" could be "they would" or "they had"), a number of simple rules were applied during development of the program:

    • words ending with n't -- e.g. "don't" -> "do" and "not"

    • words ending with 'll -- assume "will"; e.g. "they'll" -> "they" and "will"

    • words ending with 've -- assume "have"; e.g. "they've" -> "they" and "have"

    • words ending with 'd -- assume "would"; e.g. "they'd" -> "they" and "would"

    • words ending with 're -- assume "are"; e.g. "they're" -> "they" and "are"

    • words ending with 's -- assume "is" IF the preceding word is a personal pronoun; e.g. "it's" -> "it" and "is". Personal pronouns (in the subject/nominative case) that apply here are "he", "she" and "it".

    • words ending with 's -- assume possessive (i.e., an apostrophe-s) if the preceding word is not a personal pronoun; e.g. "phone's" -> "phone" and "'s" special one -- I'm (or i'm) -- assume "am".

    • For any contractions encountered, only one contraction is separated. If a word contains multiple contractions, only the last/right-most one is separated.

Screenshots:

Program execution: the data-small.txt file on the terminal

alt text


Program execution: the data-medium.txt file on the terminal

alt text


Program execution: the HG-heldout-utf8.txt file on the terminal

alt text


Usage:

The build folder, in this repository, contains all the executable files needed for running the program. Download the folder to your desktop first.

Open a terminal (or commandline shell) and navigate to the build directory. i.e. '/build/' The format for running the program is:

>> java TextPreProcessorMain "inputFile" "outputFile"

where ‘inputFile’ is the name of the .txt file containing the text data to be preprocessed, and ‘outputFile’ is the name of the .txt file wherein the results of the pre-processing should be stored

For example, while in the 'build' directory:

This command will run the program on the large-size input file ‘HG-heldout-utf8.txt’ with the output being stored in the 'output-HG-heldout-utf8.txt' file.

>> java TextPreProcessorMain "HG-heldout-utf8.txt" "output-HG-heldout-utf8.txt"

This command will run the program on the small-size input file ‘data-small.txt’ with the output being stored in the 'myOutput.txt' file.

>> java TextPreProcessorMain "data-small.txt" "myOutput.txt"

This command will run the program on the medium-size input file ‘data-medium.txt’ with the output being stored in the 'myOutput.txt' file.

>> java TextPreProcessorMain "data-medium.txt" "myOutput.txt"


Running Demo:

alt text


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.