Giter Site home page Giter Site logo

Arpine 's Projects

opencv icon opencv

Open Source Computer Vision Library

practical-nlp-with-nltk icon practical-nlp-with-nltk

Quick Hands-On NLTK tutorial for NLP in Python. NLTK is one of the most popular Python packages for Natural Language Processing (NLP). Easy to Start for Anyone.

review_object_detection_metrics icon review_object_detection_metrics

Object Detection Metrics. 14 object detection metrics: mean Average Precision (mAP), Average Recall (AR), Spatio-Temporal Tube Average Precision (STT-AP). This project supports different bounding box formats as in COCO, PASCAL, Imagenet, etc.

tcm_bert icon tcm_bert

BERT for TCM clinical records classification

text-classification-using-deep-neural-network icon text-classification-using-deep-neural-network

Natural Language Processing (NLP) is heavily being used in our text classification task. So, before we begin, I want to cover a few terms and concepts that we will be using. This will help you understand why a particular function or process is being called or at the very least clear any confusion you might have. I) Stemming – Stemming is a process applied to a single word to derive its root. Many words that are being used in a sentence are often inflected or derived. To standardize our process, we would like to stem such words and end up with only root words. For example, a stemmer will convert the following words “walking”, “walked”, “walker” to its root word “walk“. II) Tokenization – Tokens are basically words. This is a process of taking in a piece of text and find out all the unique words in the text. We would get a list of words in the text as the output of tokens. For example, for the sentence “Python NLP is just going great” we have the token list [ “Python”, “NLP”, ïs”, “just”, “going”, “great”]. So, as you can see, tokenization involves breaking up the text into words. III) Bag of Words – The Bag of Words model in Text Processing is the process of creating a unique list of words. This model is used as a tool for feature generation. Eg: consider two sentences: Star Wars is better than Star Trek. Star Trek isn’t as good as Star Wars. For the above two sentences, the bag of words will be: [“Star”, “Wars”, “Trek”, “better”, “good”, “isn’t”, “is”, “as”]. The position of each word in the list is hence fixed. Now, to construct a feature for classification from a sentence, we use a binary array ( an array where each element can either be 1 or 0). For example, a new sentence, “Wars is good” will be represented as [0,1,0,0,1,0,1,0] . As you can see in the array, position 2 is set to 1 because the word in position 2 is “wars” in the bag of words which is also present in our example sentence. This same holds good for the other words “is” and “good” as well. You can read more about the Bag of Words model here. Step 1: Data Preparation Before we train a model that can classify a given text to a particular category, we have to first prepare the data. We can create a simple JSON file that will hold the required data for training. We are using a dataset of 2014_India_floods which contains tweets from twitter as text and its assigned category. In this dataset we are having 9 different categories regarding natural disaster. We will be having a JSON with 9 categories. For each category, we have a set of sentences which we can use to train our model. Given this data, we have to classify any given sentence into one of these 9 categories. Step 2: Data Load and Pre-processing We will be creating multiple lists and each list “words” will hold all the unique stemmed words in all the sentences provided for training. Another list “categories” holds all the different categories. The output of this step is the list which contains the words from each sentence and which category the sentence belongs. An example document is ([“whats”, “your”, “age”], “age”). Step 3: Convert the data to Tensorflow Specification From the previous step, we have documents but they are still in the text form. Tensorflow being a math library accepts the data in the numeric form. So, before we begin with the tensorflow text classification, we take the text form and apply the bag of words model to convert the sentence into a numeric binary array. We then store the labels/category, in the same way, that is a numeric binary array. Step 4: Initiate Tensorflow Text Classification With the documents in the right form, we can now begin the tensorflow text classification. In this step, we build a simple Deep Neural Network and use that for training our model.The code runs for a 100 epochs with a batch size of 20 and it took around 2 hours to finish training. The size of data and the type of GPU heavily determine the time taken for training. Step 5: Testing the Tensorflow Text Classification Model We can now test the neural network text classification python model. The model was able to correctly classify almost all the sentences. There will definitely be a lot of sentences that might fail to be classified correctly. This is only because the amount of data is less. with more and more data, you can be assured the model will be more confident. Conclusion This is how you can perform tensorflow text classification. You can use this approach and scale it to perform a lot of different classification. You can use it to build chatbots as well. How users can get started with the project NGO's, organisations etc can get categorical tweets from our project which can help them to get different informations like information regarding infrastructure damage, no. of deaths etc. So, this project can help them to figure out current situation and take decision accordingly. Dataset used We are using dataset of the 2014 India Floods. Technologies used Python Information Retrieval Natural Language Processing Deep learning Tensorflow NLTK

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.