Giter Site home page Giter Site logo

kavgan / nlp-in-practice Goto Github PK

View Code? Open in Web Editor NEW
1.1K 51.0 783.0 93.96 MB

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Home Page: http://kavita-ganesan.com/kavitas-tutorials/#.WvIizNMvyog

Jupyter Notebook 99.28% Python 0.72%
nlp natural-language-processing word2vec text-classification gensim tf-idf machine-learning text-mining

nlp-in-practice's Introduction

NLP-IN-PRACTICE

Use these NLP, Text Mining and Machine Learning code samples and tools to solve real world text data problems.

Notebooks / Source

Links in the first column take you to the subfolder/repository with the source code.

Task Related Article Source Type Description
Large Scale Phrase Extraction phrase2vec article python script Extract phrases for large amounts of data using PySpark. Annotate text using these phrases or use the phrases for other downstream tasks.
Word Cloud for Jupyter Notebook and Python Web Apps word_cloud article python script + notebook Visualize top keywords using word counts or tfidf
Gensim Word2Vec (with dataset) word2vec article notebook How to work correctly with Word2Vec to get desired results
Reading files and word count with Spark spark article python script How to read files of different formats using PySpark with a word count example
Extracting Keywords with TF-IDF and SKLearn (with dataset) tfidf article notebook How to extract interesting keywords from text using TF-IDF and Python's SKLEARN
Text Preprocessing text preprocessing article notebook A few code snippets on how to perform text preprocessing. Includes stemming, noise removal, lemmatization and stop word removal.
TFIDFTransformer vs. TFIDFVectorizer tfidftransformer and tfidfvectorizer usage article notebook How to use TFIDFTransformer and TFIDFVectorizer correctly and the difference between the two and what to use when.
Accessing Pre-trained Word Embeddings with Gensim Pre-trained word embeddings article notebook How to access pre-trained GloVe and Word2Vec Embeddings using Gensim and an example of how these embeddings can be leveraged for text similarity
Text Classification in Python (with news dataset) Text classification with Logistic Regression article notebook Get started with text classification. Learn how to build and evaluate a text classifier for news classification using Logistic Regression.
CountVectorizer Usage Examples How to Correctly Use CountVectorizer? An In-Depth Look article notebook Learn how to maximize the use of CountVectorizer such that you are not just computing counts of words, but also preprocessing your text data appropriately as well as extracting additional features from your text dataset.
HashingVectorizer Examples HashingVectorizer Vs. CountVectorizer article notebook Learn the differences between HashingVectorizer and CountVectorizer and when to use which.
CBOW vs. SkipGram Word2Vec: A Comparison Between CBOW, SkipGram & SkipGramSI article notebook A quick comparison of the three embeddings architecture.

Notes

Contact

This repository is maintained by Kavita Ganesan. Connect with me on LinkedIn or Twitter.

nlp-in-practice's People

Contributors

brusic avatar kavgan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp-in-practice's Issues

Cloning Repos

Hi I am trying to clone your repos to follow along, but I am unable to do so.

binary data inside text lines of reviews_data.txt.gz of word2vec sample

there is a binary RAR file snugged inside text lines of "reviews_data.txt.gz"

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

036572F0  6E 20 65 6C 20 65 71 75 69 70 61 6A 65 20 65 6E  n el equipaje en
03657300  20 65 6C 20 68 6F 74 65 6C 09 09 0D 0A 46 65 62   el hotel....Feb
03657310  20 32 31 20 32 30 30 39 20 09 74 72 E8 73 20 62   21 2009 .très b
03657320  6F 6E 20 72 61 70 70 6F 72 74 20 71 75 61 6C 69  on rapport quali
03657330  74 E9 20 70 72 69 78 09 09 0D 0A 4A 61 6E 20 34  té prix....Jan 4
03657340  20 32 30 30 39 20 09 43 61 72 69 6E 6F 20 6D 61   2009 .Carino ma
03657350  20 76 65 63 63 68 69 6F 2E 09 09 0D 0A 44 65 63   vecchio.....Dec
03657360  20 32 36 20 32 30 30 38 20 09 3F 3F 3F 3F 3F 3F   26 2008 .??????
03657370  3F 3F 3F 3F 09 09 0D 0A 4F 63 74 20 32 35 20 32  ????....Oct 25 2
03657380  30 30 38 20 09 74 72 E8 73 20 62 6F 6E 20 68 F4  008 .très bon hô
03657390  74 65 6C 09 09 0D 0A 53 65 70 20 32 33 20 32 30  tel....Sep 23 20
036573A0  30 38 20 09 65 78 63 65 6C 6C 65 6E 74 65 20 65  08 .excellente e
036573B0  78 70 E9 72 69 65 6E 63 65 09 09 0D 0A 52 61 72  xpérience....Rar
036573C0  21 1A 07 00 CF 90 73 00 00 0D 00 00 00 00 00 00  !...Ï.s.........
036573D0  00 07 2A 74 80 90 4E 00 EF 72 03 00 B2 E0 0A 00  ..*t€.N.ïr..²à..
036573E0  02 6A 2C 9E 26 17 52 83 3B 1D 33 29 00 20 00 00  .j,ž&.Rƒ;.3). ..
036573F0  00 75 73 61 5F 6E 65 76 61 64 61 5F 6C 61 73 2D  .usa_nevada_las-
03657400  76 65 67 61 73 5F 72 69 76 69 65 72 61 5F 68 6F  vegas_riviera_ho
03657410  74 65 6C 5F 63 61 73 69 6E 6F 00 B0 72 6F 91 14  tel_casino.°ro‘.
03657420  1D 51 0C CC D1 51 90 19 D9 7E CF 35 AC E8 72 AF  .Q.ÌÑQ..Ù~Ï5¬èr¯
03657430  4F 31 A5 96 49 A6 93 6E AA 9D 79 F0 E6 3F 4D 55  O1¥–I¦“nª.yðæ?MU
03657440  2B E3 A6 DD B6 A9 BF 2B E8 0A 20 A4 38 89 00 D8  +ã¦Ý¶©¿+è. ¤8‰.Ø
03657450  00 A3 46 BA 32 F3 3A 0F 3B 47 35 6D D7 A1 A4 4A  .£Fº2ó:.;G5mס¤J
03657460  8D EE 22 26 64 12 9F 2F CE 80 BB E7 38 DB 24 09  .î"&d.Ÿ/΀»ç8Û$.
03657470  13 E8 89 8F 6C C0 3D 21 1F C2 6B F7 ED CE F7 20  .è‰.lÀ=!.Âk÷íÎ÷ 

inside that RAR there is only one file named "usa_nevada_las-vegas_riviera_hotel_casino", which contains some duplicated lines from .gz file

this causes UnicodeDecodeError exception under Windows

you can use open(...., errors='replace') to replace binary data with ? marks

How to automate for automated prediction.

I've connected the model to my data. I get a 95% accuracy rate!. It works perfectly. Now, I'm trying to use the model to iterate through the entire dataset and return the result. I'm trying to output the predictions for all 63K items.

I've tried simple for loop:

for each in df['short_description'].head(5):

test_features=transformer.transform(each)
get_top_k_predictions(model,each,2)

this returns:
ValueError: Iterable over raw text documents expected, string object received.

my intention is to use this as a 2nd method of prediction to verify the results of the structured programming that I've done. And as time passes, eventually, it will be the primary method.

There are 63K records in the file (and growing).

Any help would be greatly appreciated.

Dataset file is not a gzip file

$ tar -zxvf reviews_data.txt.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Inspect file:

$ file reviews_data.txt.gz 
reviews_data.txt.gz: HTML document, UTF-8 Unicode text, with very long lines

$head reviews_data.txt.gz 

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">

Python gives gzip error as well

/usr/lib/python3.5/gzip.py in _read_gzip_header(self)
    407 
    408         if magic != b'\037\213':
--> 409             raise OSError('Not a gzipped file (%r)' % magic)
    410 
    411         (method, flag,

OSError: Not a gzipped file (b'\n\n')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.