kavgan / nlp-in-practice Goto Github PK

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Home Page: http://kavita-ganesan.com/kavitas-tutorials/#.WvIizNMvyog

Jupyter Notebook 99.28% Python 0.72%

nlp natural-language-processing word2vec text-classification gensim tf-idf machine-learning text-mining

nlp-in-practice's Introduction

NLP-IN-PRACTICE

Use these NLP, Text Mining and Machine Learning code samples and tools to solve real world text data problems.

Notebooks / Source

Links in the first column take you to the subfolder/repository with the source code.

Task	Related Article	Source Type	Description
Large Scale Phrase Extraction	phrase2vec article	python script	Extract phrases for large amounts of data using PySpark. Annotate text using these phrases or use the phrases for other downstream tasks.
Word Cloud for Jupyter Notebook and Python Web Apps	word_cloud article	python script + notebook	Visualize top keywords using word counts or tfidf
Gensim Word2Vec (with dataset)	word2vec article	notebook	How to work correctly with Word2Vec to get desired results
Reading files and word count with Spark	spark article	python script	How to read files of different formats using PySpark with a word count example
Extracting Keywords with TF-IDF and SKLearn (with dataset)	tfidf article	notebook	How to extract interesting keywords from text using TF-IDF and Python's SKLEARN
Text Preprocessing	text preprocessing article	notebook	A few code snippets on how to perform text preprocessing. Includes stemming, noise removal, lemmatization and stop word removal.
TFIDFTransformer vs. TFIDFVectorizer	tfidftransformer and tfidfvectorizer usage article	notebook	How to use TFIDFTransformer and TFIDFVectorizer correctly and the difference between the two and what to use when.
Accessing Pre-trained Word Embeddings with Gensim	Pre-trained word embeddings article	notebook	How to access pre-trained GloVe and Word2Vec Embeddings using Gensim and an example of how these embeddings can be leveraged for text similarity
Text Classification in Python (with news dataset)	Text classification with Logistic Regression article	notebook	Get started with text classification. Learn how to build and evaluate a text classifier for news classification using Logistic Regression.
CountVectorizer Usage Examples	How to Correctly Use CountVectorizer? An In-Depth Look article	notebook	Learn how to maximize the use of CountVectorizer such that you are not just computing counts of words, but also preprocessing your text data appropriately as well as extracting additional features from your text dataset.
HashingVectorizer Examples	HashingVectorizer Vs. CountVectorizer article	notebook	Learn the differences between HashingVectorizer and CountVectorizer and when to use which.
CBOW vs. SkipGram	Word2Vec: A Comparison Between CBOW, SkipGram & SkipGramSI article	notebook	A quick comparison of the three embeddings architecture.

Notes

For more articles, please see this list.
If you would like to receive articles via email subscribe to my mailing list.

Contact

This repository is maintained by Kavita Ganesan. Connect with me on LinkedIn or Twitter.

nlp-in-practice's People

Contributors

Stargazers

Watchers

Forkers

mohisen dthboyd hafizurcse ersiu anthonyyeo drjietao ibrahemamer karimaz96 cteicher-m haisi carlosandres12 arfu2016 chetansaini39 abiraja2004 danillolb wandabwa2004 huibaobao ravikolanpaka knatarajan1866 sweetcard vishalisingithub rowancassius mabusalah michaelkoller afcarl clementcourti ameyem-skill-labs marylvv yest nniens j2kao misseuro mjdahlquist moenchishti jasim-tspl moisehannah mahluwalia sergio-dobrianskiy gujgicza kartikmehta15 aymansalama divyanshugairola tejas4adabala discovery666 shivbaj fortpete rrmehdi mugurd fahrmairm tkunwar lakemang pensun007 ahmedfadhil bedayat pubali martinvraspir rubiel1 paps272003 tfish28 ongdingliang fd9020 bulentarslan renaldoberkeley himanshu98 anuragsharma20 cbruceperkins llinea jeroen68 lynnchan90 gromag irinamax karam93 jesusmiguelgarcia margokhokhlova sjuxplore princepurohit153 abhiram4572 ashtava raghuvar jonas-bit mohit2494 omari1988 ryanji ginaacosta strunge29 mahmoudfateaha beige-coffee najibghadri ami-buch prathyakshun ramsahay punchwes acharya0318 wannabeds farouzakarya shivp0616 shatha2014 hakeydotom liuwujijay omeshkumaryadav

nlp-in-practice's Issues

Error Opening Text Preprocessing Examples.ipynb

When I try to open the notebook Text Preprocessing Examples.ipynb it gives me an error in VSCode:

Unable to open 'Text Preprocessing Examples.ipynb': Unexpected token < in JSON at position 6.

When I try to upload it in Jupyter, it says:
Cannot upload invalid Notebook
The error was: SyntaxError: JSON Parse error: Unrecognized token '<'

can you guide me how to save a model and use the model?

Thanks for tut.

I am going to build the app, base on this model .
Could you guide me how to save a model and call it ?

Thanks & Best Regards.

label_ranking_average_precision_score

mrr has been implemented as a class in sklearn called "label_ranking_average_precision_score".

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.label_ranking_average_precision_score.html

The code in "Text Classification with Logistic Regression.ipynb" file will be much shorter if it is used.

Cloning Repos

Hi I am trying to clone your repos to follow along, but I am unable to do so.

binary data inside text lines of reviews_data.txt.gz of word2vec sample

there is a binary RAR file snugged inside text lines of "reviews_data.txt.gz"

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

036572F0  6E 20 65 6C 20 65 71 75 69 70 61 6A 65 20 65 6E  n el equipaje en
03657300  20 65 6C 20 68 6F 74 65 6C 09 09 0D 0A 46 65 62   el hotel....Feb
03657310  20 32 31 20 32 30 30 39 20 09 74 72 E8 73 20 62   21 2009 .très b
03657320  6F 6E 20 72 61 70 70 6F 72 74 20 71 75 61 6C 69  on rapport quali
03657330  74 E9 20 70 72 69 78 09 09 0D 0A 4A 61 6E 20 34  té prix....Jan 4
03657340  20 32 30 30 39 20 09 43 61 72 69 6E 6F 20 6D 61   2009 .Carino ma
03657350  20 76 65 63 63 68 69 6F 2E 09 09 0D 0A 44 65 63   vecchio.....Dec
03657360  20 32 36 20 32 30 30 38 20 09 3F 3F 3F 3F 3F 3F   26 2008 .??????
03657370  3F 3F 3F 3F 09 09 0D 0A 4F 63 74 20 32 35 20 32  ????....Oct 25 2
03657380  30 30 38 20 09 74 72 E8 73 20 62 6F 6E 20 68 F4  008 .très bon hô
03657390  74 65 6C 09 09 0D 0A 53 65 70 20 32 33 20 32 30  tel....Sep 23 20
036573A0  30 38 20 09 65 78 63 65 6C 6C 65 6E 74 65 20 65  08 .excellente e
036573B0  78 70 E9 72 69 65 6E 63 65 09 09 0D 0A 52 61 72  xpérience....Rar
036573C0  21 1A 07 00 CF 90 73 00 00 0D 00 00 00 00 00 00  !...Ï.s.........
036573D0  00 07 2A 74 80 90 4E 00 EF 72 03 00 B2 E0 0A 00  ..*t€.N.ïr..²à..
036573E0  02 6A 2C 9E 26 17 52 83 3B 1D 33 29 00 20 00 00  .j,ž&.Rƒ;.3). ..
036573F0  00 75 73 61 5F 6E 65 76 61 64 61 5F 6C 61 73 2D  .usa_nevada_las-
03657400  76 65 67 61 73 5F 72 69 76 69 65 72 61 5F 68 6F  vegas_riviera_ho
03657410  74 65 6C 5F 63 61 73 69 6E 6F 00 B0 72 6F 91 14  tel_casino.°ro‘.
03657420  1D 51 0C CC D1 51 90 19 D9 7E CF 35 AC E8 72 AF  .Q.ÌÑQ..Ù~Ï5¬èr¯
03657430  4F 31 A5 96 49 A6 93 6E AA 9D 79 F0 E6 3F 4D 55  O1¥–I¦“nª.yðæ?MU
03657440  2B E3 A6 DD B6 A9 BF 2B E8 0A 20 A4 38 89 00 D8  +ã¦Ý¶©¿+è. ¤8‰.Ø
03657450  00 A3 46 BA 32 F3 3A 0F 3B 47 35 6D D7 A1 A4 4A  .£Fº2ó:.;G5m×¡¤J
03657460  8D EE 22 26 64 12 9F 2F CE 80 BB E7 38 DB 24 09  .î"&d.Ÿ/Î€»ç8Û$.
03657470  13 E8 89 8F 6C C0 3D 21 1F C2 6B F7 ED CE F7 20  .è‰.lÀ=!.Âk÷íÎ÷

inside that RAR there is only one file named "usa_nevada_las-vegas_riviera_hotel_casino", which contains some duplicated lines from .gz file

this causes UnicodeDecodeError exception under Windows

you can use open(...., errors='replace') to replace binary data with ? marks

model name is incorrect

There is a typo in the script.
The name of the model is model_glove_twitter and not just model in this code:

for p in phrases:
    tokens_1=[t for t in p.split() if t in model.wv.vocab]
    tokens_2=[t for t in query.split() if t in model.wv.vocab]

https://github.com/kavgan/nlp-in-practice/blob/master/pre-trained-embeddings/Pre-trained%20embeddings.ipynb

How to automate for automated prediction.

I've connected the model to my data. I get a 95% accuracy rate!. It works perfectly. Now, I'm trying to use the model to iterate through the entire dataset and return the result. I'm trying to output the predictions for all 63K items.

I've tried simple for loop:

for each in df['short_description'].head(5):
test_features=transformer.transform(each)
get_top_k_predictions(model,each,2)

this returns:
ValueError: Iterable over raw text documents expected, string object received.

my intention is to use this as a 2nd method of prediction to verify the results of the structured programming that I've done. And as time passes, eventually, it will be the primary method.

There are 63K records in the file (and growing).

Any help would be greatly appreciated.

Dataset file is not a gzip file

$ tar -zxvf reviews_data.txt.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Inspect file:

$ file reviews_data.txt.gz 
reviews_data.txt.gz: HTML document, UTF-8 Unicode text, with very long lines

$head reviews_data.txt.gz 

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">

Python gives gzip error as well

/usr/lib/python3.5/gzip.py in _read_gzip_header(self)
    407 
    408         if magic != b'\037\213':
--> 409             raise OSError('Not a gzipped file (%r)' % magic)
    410 
    411         (method, flag,

OSError: Not a gzipped file (b'\n\n')