susanli2016 / nlp-with-python Goto Github PK

View Code? Open in Web Editor NEW

2.7K 161.0 2.0K 28.4 MB

Scikit-Learn, NLTK, Spacy, Gensim, Textblob and more

Jupyter Notebook 98.96% Python 0.02% HTML 1.01%

nlp-with-python's Introduction

NLP with Python

Scikit-Learn, NLTK, Spacy, Gensim, Textblob and more

nlp-with-python's People

Contributors

Stargazers

Watchers

Forkers

rsbegue slimlime manjunathgit kevinwkc kkim610 dthboyd lazycrazyowl gybta ersinkaldi wagner-rodeski atifs justinchi kriswu618 brusko jonaqp gdsttian logistics-shypz henrybrowne newenglandml alfords rogervaas xicai529 sln0305 shivanandroy jennfer0808 alexblack2202 luhg keiaisi itwangquan haonanli guokeda leicao-me qss2012 nishantsbi esskay0000 msoancah jabogithub vinodpathak dineshiitkgp sahanduiuc jbdatascience zizibong jkim444 jpark77 fedjohn lexiewen0418 srikantj12 venkteshv voitsehovska chenjun0210 emediacode godfatherace dhirendra101 ramarajusk1308 coloratto muzafferestelik snowdj tatenda-ndambakuwa dynamicdeploy xinminglieric nixiaoyang alabsinatheer wisnukurniawan vchernova carlosr29 deshratan45 canwe nhatnguyen12 fototo tle4336 sasa935 laurii argdata tjpeng kekeletsoqacha padmajahuddar miaviles imaduddin76 heiiwg fen8zhan deathtonepunk sophiegwx jermanis stealth-alex rkly gp1313 mcsuy akbulutsefa melissaforti anirband dmitriyvaletov lakshmisnhu mormukut11 raveensr nilportugues cheng-yi-ting tcandzq sulasen sachinkolase mrunalb09

nlp-with-python's Issues

About doc2bow

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_doc]
This lines sometimes works and sometime not.
like Sometimes i get result but sometimes it returns empty
bow_corpus is empty

About code

for idx, topic in lda_model_tfidf.print_topics(-1):
print('Topic: {} Word: {}'.format(idx, topic))
whats the significance of -1 in first line...

Toxic Comment training/ testing data

Many thanks for making these notebooks available!

Do you have the CSV files data/toxic_train.csv, and data/toxic_test.csv from https://towardsdatascience.com/classify-toxic-online-comments-with-lstm-and-glove-e455a58da9c7 (and: https://github.com/susanli2016/NLP-with-Python/blob/master/Toxic%20Comments%20LSTM%20GloVe.ipynb) available that I can use?

I'd like to run the notebook but I don't know where to find the source data.

Thanks again 🦄

In bayseian

I have almost replicated your model and im getting error while finding precision,recall,F1 score values and i have done all kinds of debugging and still could not able to find out.

Save the final concatenated model

Hello, I want to save the final concatenated model, so that I could predict new sentences using that.

README

The repo looks really helpful but the README needs a lot of work

Missing import

In Toxic Comments LSTM GloVe

import pandas as pd

issue in block [7] of the code. AttributeError: 'module' object has no attribute 'tslib'

The error is as follows

AttributeError: 'module' object has no attribute 'tslib'

Hello Mam,
I tried to replicate your work for my own project.(My native language is Gujarati so I want to make small machine translation application). Your article in towardsdatascience.com and this github file very helpful till now, but I am facing and issue at fitting model.
Will you help me to solve the issue?
Thank you.
error is listed below

How to test on new values in Text_classification_with_BERT.ipynb

As title suggests, how do I test on new values rather than simply on validation dataloader? Please and thank you.

Doubt regarding data pass to saved model

Dear sir,
This is simple question but i am not sure about the result.I am working on topic model and using lda algorithm which is unsupervised learning algorithm.I have total 120 documents.i have divided documents into 100 and 20.Firstly, i passed 100 documents to th model and store the result.Then i saved the model and dictionary.
Now this saved model is used to predict the topics of 20 documents and got the result.

as we know saved model is used to predict the unseen data.
My question is can i pass first 100 documents (documents is used during saving the model)together with 20 documents(unseen document) to the saved model? Is it possible? will it effect the model performance.

Predicting from model

Hello ,
That's really wonderful work .However when I saved the model and tried to predict it didn't work.
What is the best way to save it and use it fro later predictions.

How to add POS_TAG feature?

Your code is very clear, THAKS FOR THAT!

But how to add more features to this model?
You only took words, but what if I want to add POS_TAGS also?

Error in line 67

I am getting an error on line 67, List index out of range. Please help me with it.

The numbers of items and labels differ: |x| = 18260, |y| = 1

Hai, I got this problem:

The numbers of items and labels differ: |x| = 18260, |y| = 1

when i try to my own indonesian dataset, i have:
X_train = (8675, 18260)
y_train = 8675

can you help me please?

Looping

Hello everyone,
After solving some issues with the code I successfully managed to extract a few csv files of reviews.
Sometimes, however, the extraction starts looping on the 5 latest reviews and never stops, hence never extracting the csv file.
Do you have any idea how to help me stop as soon as the review ids are duplicated? Or even just to produce a .csv file without having to wait for it to finish? I'm attaching my version of Susanli's code.


import requests
from bs4 import BeautifulSoup
import csv
import webbrowser
import io

def display(content, filename='output.html'):
	with open(filename, 'wb') as f:
		 f.write(content)
		 webbrowser.open(filename)

		 
def get_soup(session, url, show=False):
	r = session.get(url)
	if show:
		display(r.content, 'temp.html')
	if r.status_code != 200: # not OK
		print('[get_soup] status code:', r.status_code)
	else:
		return BeautifulSoup(r.text, 'html.parser')

	
def post_soup(session, url, params, show=False):
    '''Read HTML from server and convert to Soup'''

    r = session.post(url, data=params)
    
    if show:
        display(r.content, 'temp.html')

    if r.status_code != 200: # not OK
        print('[post_soup] status code:', r.status_code)
    else:
        return BeautifulSoup(r.text, 'html.parser')

def scrape(url, lang='ALL'):

    # create session to keep all cookies (etc.) between requests
    session = requests.Session()

    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0',
    })


    items = parse(session, url + '?filterLang=' + lang)

    return items

def parse(session, url):
    '''Get number of reviews and start getting subpages with reviews'''

    print('[parse] url:', url)

    soup = get_soup(session, url)

    if not soup:
        print('[parse] no soup:', url)
        return

    num_reviews = soup.find('span', class_='reviews_header_count').text # get text    
    num_reviews = num_reviews[1:-1] 
    num_reviews = num_reviews.replace(',', '')
    num_reviews = int(num_reviews) # convert text into integer
    print('[parse] num_reviews ALL:', num_reviews)

    url_template = url.replace('.html', '-or{}.html')
    print('[parse] url_template:', url_template)

    items = []

    offset = 0

    while(True):
        subpage_url = url_template.format(offset)

        subpage_items = parse_reviews(session, subpage_url)
        if not subpage_items:
            break

        items += subpage_items

        if len(subpage_items) < 5:
            break

        offset += 5

    return items

def get_reviews_ids(soup):

    items = soup.find_all('div', attrs={'data-reviewid': True})

    if items:
        reviews_ids = [x.attrs['data-reviewid'] for x in items][::2]
        print('[get_reviews_ids] data-reviewid:', reviews_ids)
        return reviews_ids

def get_more(session, reviews_ids):

    url = 'https://www.tripadvisor.com/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=Hotel_Review'

    payload = {
        'reviews': ','.join(reviews_ids), # ie. "577882734,577547902,577300887",
        #'contextChoice': 'DETAIL_HR', # ???
        'widgetChoice': 'EXPANDED_HOTEL_REVIEW_HSX', # ???
        'haveJses': 'earlyRequireDefine,amdearly,global_error,long_lived_global,apg-Hotel_Review,apg-Hotel_Review-in,bootstrap,desktop-rooms-guests-dust-en_US,responsive-calendar-templates-dust-en_US,taevents',
        'haveCsses': 'apg-Hotel_Review-in',
        'Action': 'install',
    }

    soup = post_soup(session, url, payload)

    return soup

def parse_reviews(session, url):
    '''Get all reviews from one page'''

    print('[parse_reviews] url:', url)

    soup =  get_soup(session, url)

    if not soup:
        print('[parse_reviews] no soup:', url)
        return

    hotel_name = soup.find('h1').text

    reviews_ids = get_reviews_ids(soup)
    if not reviews_ids:
        return

    soup = get_more(session, reviews_ids)

    if not soup:
        print('[parse_reviews] no soup:', url)
        return

    items = []

    for idx, review in enumerate(soup.find_all('div', class_='reviewSelector')):

        badgets = review.find_all('span', class_='badgetext')
        if len(badgets) > 0:
            contributions = badgets[0].text
        else:
            contributions = '0'

        if len(badgets) > 1:
            helpful_vote = badgets[1].text
        else:
            helpful_vote = '0'
        user_loc = review.select_one('div.userLoc strong')
        if user_loc:
            user_loc = user_loc.text
        else:
            user_loc = ''
            
        bubble_rating = review.select_one('span.ui_bubble_rating')['class']
        bubble_rating = bubble_rating[1].split('_')[-1]

        item = {
            'review_body': review.find('p', class_='partial_entry').text,
            'review_date': review.find('span', class_='ratingDate')['title'], # 'ratingDate' instead of 'relativeDate'
        }

        items.append(item)
        print('\n--- review ---\n')
        for key,val in item.items():
            print(' ', key, ':', val)

    print()

    return items

def write_in_csv(items, filename='results.csv',
                  headers=['hotel name', 'review title', 'review body',
                           'review date', 'contributions', 'helpful vote',
                           'user name' , 'user location', 'rating'],
                  mode='w'):

    print('--- CSV ---')

    with io.open(filename, mode, encoding="utf-8") as csvfile:
        csv_file = csv.DictWriter(csvfile, headers)

        if mode == 'w':
            csv_file.writeheader()

        csv_file.writerows(items)

DB_COLUMN   = 'review_body'
DB_COLUMN1 = 'review_date'

start_urls = [
    'https://www.tripadvisor.com/Restaurant_Review-g187813-d12072302-Reviews-Eataly_Trieste-Trieste_Province_of_Trieste_Friuli_Venezia_Giulia.html',
]

headers = [ 
    DB_COLUMN, 
    DB_COLUMN1, 
]

lang = 'it'

for url in start_urls:

    # get all reviews for 'url' and 'lang'
    items = scrape(url, lang)

    if not items:
        print('No reviews')
    else:
        # write in CSV
        filename = url.split('Reviews-')[1][:-5] + '__' + lang
        print('filename:', filename)
        write_in_csv(items, filename + '.csv', headers, mode='w')

consumer_complaints_small.csv

Kindly tell where is the consumer_complaints_small.csv file?

Consulting on missing packages

Hey, sudanli, I'm learning machine translation and I find your code very elegant and concise, so I want t o run your code just for learning. But in the "machine_translation.ipynb" file, you import the helper and tests packages, but I can not identify where is the code in your project, so would you help me with this? Thank you in advance.

string object has no attribute desc

I tried the github code as explained in this article...
https://towardsdatascience.com/automatically-generate-hotel-descriptions-with-lstm-afa37002d4fc?source=rss----7f60cf5620c9---4

But got an error in the last cell:
print(generate_text("hilton seattle downtown", 100, model, max_sequence_len))
AttributeError: 'str' object has no attribute 'desc'

All lines on this page except the last line works as expected:
https://github.com/susanli2016/NLP-with-Python/blob/master/Hotel%20Description%20Generation%20LSTM.ipynb

Newbie

Hello Susanli,
I tried to use your code but I received this error:

AttributeError                            Traceback (most recent call last)
<ipython-input-2-fede94e53ef8> in <module>
    211 
    212     # get all reviews for 'url' and 'lang'
--> 213     items = scrape(url, lang)
    214 
    215     if not items:

<ipython-input-2-fede94e53ef8> in scrape(url, lang)
     48 
     49 
---> 50     items = parse(session, url + '?filterLang=' + lang)
     51 
     52     return items

<ipython-input-2-fede94e53ef8> in parse(session, url)
     63         return
     64 
---> 65     num_reviews = soup.find('span', class_='reviews_header_count').text # get text
     66     num_reviews = num_reviews[1:-1]
     67     num_reviews = num_reviews.replace(',', '')

AttributeError: 'NoneType' object has no attribute 'text'

I'm university business professor but newbie with Python, could you help me to use your solution for scrap trip advisor hotel reviews ? Thanks in advance

requirements.txt

I'm attempting to follow the steps and running into issues with this line

" simple_rnn_model = simple_model(tmp_x.shape, max_french_sequence_length, english_vocab_size, french_vocab_size) "

this seems to be a version difference in our tensorflow/keras versions. Could you potentially create a requirements.txt and include version numbers of these packages?

It's not work for machine translation between english and chinese

when I use english and chinese(change frence to chinese), got the only ' '.

y contains previously unseen labels

Tried the script on lots of datasets , getting this error very frequently . pls help

ValueError: 'a' must be greater than 0 unless no samples are taken

I was trying to use this same concept for my dataset and got the error in the title. Could anybody help? I also checked the shape of the dataframe which was (56, 11).

The error message is below:

ValueError Traceback (most recent call last)
in
1 print('5 random songs with the highest positive sentiment polarity: \n')
----> 2 hps = data.loc[data['polarity'] == 0.3, ['titles']].sample().values
3 for h in hps:
4 print(h[0])

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in sample(self, n, frac, replace, weights, random_state, axis)
5059 )
5060
-> 5061 locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
5062 return self.take(locs, axis=axis)
5063

mtrand.pyx in mtrand.RandomState.choice()

ValueError: 'a' must be greater than 0 unless no samples are taken

Text_classification_with_BERT.ipynb

You classified the documents based on the labels but did not use the 'Title'. Is it possible to classify text documents without even using the text?

machine translation code issue

i am getting the error after this code

def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
"""
Build and train a basic RNN on x and y
:param input_shape: Tuple of input shape
:param output_sequence_length: Length of output sequence
:param english_vocab_size: Number of unique English words in the dataset
:param french_vocab_size: Number of unique French words in the dataset
:return: Keras model built, but not trained
"""
# TODO: Build the layers
learning_rate = 1e-3
input_seq = Input(input_shape[1:])
rnn = GRU(64, return_sequences = True)(input_seq)
logits = TimeDistributed(Dense(french_vocab_size))(rnn)
model = Model(input_seq, Activation('softmax')(logits))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])

return model

tests.test_simple_model(simple_model)

Reshaping the input to work with a basic RNN

tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

Train the neural network

simple_rnn_model = simple_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

Print prediction(s)

print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

i am getting assertion error of not using the sparse cross entropy function ....using tensorflow 2.1.0 and python 3.6

IndexError: index 0 is out of bounds for axis 0 with size 0

doc_sample = documents[documents['index'] == 4310].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

when I am running these lines in my document its showing indexerror please help
IndexError: index 0 is out of bounds for axis 0 with size 0

Error while training (InvalidArgumentError)

On running the code as it is on the provided data I get the following error :

python: 3.7
tf: 1.15
keras: 2.3.1

InvalidArgumentError: Incompatible shapes: [21504] vs. [1024,21]
[[{{node metrics_11/acc/Equal}}]]

Windows gives error when installing from source

I'm following the directions Spacy gives to install for Windows, Python 3, and from source (pip and conda have both given me errors that I've still been unable to resolve, directly from source seems to get the closest to actually installing). However, when I get to step 3 and enter export PYTHONPATH = pwd in the command line (with the quotes around pwd like it wants, it just messes up the formatting here), I get this error message:

export is not recognized as an internal or external command, operable program, or batch file.

About the dataset

Hello @susanli2016. I'm really interested in the dataset you've used, specifically the corona_fake.csv dataset. I would like to know from where you got it or how you were able to scrape it. Thanks!

Corrections done and exporting to JSON

I made few corrections to get the code working, I masked out things I don't need, but left then for other users. This code collects User, Data, Title and Review - THERE ARE ISSUES WITH THE USER, sometimes appends the city/country.

I needed the code to export to JSON, so I created a list of lists;
[0,{user, date, review}],[1,{user,date,review]},[2,{user,date,review]}....

I also corrected some cosmetic issues with the URL as it requested or{} and or{0}, and added a way out of the infinite loop issue reported on another thread.

Initial code was VERY HELPFUL, but didn't do what I needed to do and has some bugs; I hope this helps other like me, that want to extract reports to JSON.

# def with r if if return def def # # # # # # # # # # # # # # # def start_urls ] for -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup import json import webbrowser import io display(content, filename='output.html'): open(filename, 'wb') as f: f.write(content) webbrowser.open(filename) def get_soup(session, url, show=False): = session.get(url) show: display(r.content, 'temp.html') r.status_code != 200: # not OK print('[get_soup] status code:', r.status_code) else: BeautifulSoup(r.text, 'html.parser') def post_soup(session, url, params, show=False): '''Read HTML from server and convert to Soup''' r = session.post(url, data=params) display(r.content, 'temp.html') if r.status_code != 200: # not OK print('[post_soup] status code:', r.status_code) else: def scrape(url, lang='ALL'): # create session to keep all cookies (etc.) between requests session = requests.Session() session.headers.update({ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0', }) items = parse(session, url + '?filterLang=' + lang) return items parse(session, url): '''Get number of reviews and start getting subpages with reviews''' print('[parse] url:', url) soup = get_soup(session, url) if not soup: print('[parse] no soup:', url) return num_reviews = soup.find('span', class_='reviews_header_count').text # get text num_reviews = num_reviews[1:-1] num_reviews = num_reviews.replace(',', '') num_reviews = int(num_reviews) # convert text into integer print('[parse] num_reviews ALL:', num_reviews) url_template = url.replace('.html', '-or{}.html') #print('[parse] url_template:', url_template) items = [] offset = 0 while(True): subpage_url = url_template.format(offset) if (not offset) or (offset == 0): subpage_items = parse_reviews(session, url) else: subpage_items = parse_reviews(session, subpage_url) if not subpage_items: break items += subpage_items if len(subpage_items) < 10: break offset += 10 if offset > num_reviews: break return items get_reviews_ids(soup): items = soup.find_all('div', attrs={'data-reviewid': True}) if items: reviews_ids = [x.attrs['data-reviewid'] for x in items][::2] print('[get_reviews_ids] data-reviewid:', reviews_ids) return reviews_ids def parse_reviews(session, url): '''Get all reviews from one page''' print('[parse_reviews] url:', url) soup = get_soup(session, url) if not soup: print('[parse_reviews] no soup:', url) return hotel_name = soup.find('h1').text reviews_ids = get_reviews_ids(soup) if not reviews_ids: return #soup = get_more(session, reviews_ids) if not soup: print('[parse_reviews] no soup:', url) return items = [] for idx, review in enumerate(soup.find_all('div', class_='reviewSelector')): badgets = review.find_all('span', class_='badgetext') if len(badgets) > 0: contributions = badgets[0].text else: contributions = '0' if len(badgets) > 1: # helpful_vote = badgets[1].text # else: helpful_vote = '0' user_loc = review.select_one('div.userLoc strong') if user_loc: user_loc = user_loc.text else: user_loc = '' # bubble_rating = review.select_one('span.ui_bubble_rating')['class'] # bubble_rating = bubble_rating[1].split('_')[-1] item = { 'review_user' : review.find('div', class_='info_text').text, 'review_date' : review.find('span', class_='ratingDate').text, # 'ratingDate' instead of 'relativeDate' 'review_title': review.find('span', class_='noQuotes').text, #'review_content': review.find('div', class_='entry').text, } items.append(item) print('\n--- Review --- \n') for key,val in item.items(): print(key+' :', val) print() return items write_in_json(items, filename,): review = 0 reviews = [] total_reviews = len(items) #Create a List of Lists while (review < total_reviews) : reviews.append((review,items[review])) print(reviews[review]) review += 1 print('--- Writing File ---') with open(filename, 'w') as jsonfile: json.dump(reviews, jsonfile, ensure_ascii=False, indent=2) print('--- Done ---') = [ 'https://www.tripadvisor.es/Restaurant_Review-g187452-d12206985-Reviews-Baiba_Cafe-Oviedo_Asturias.html', lang = 'es' url in start_urls: # get all reviews for 'url' and 'lang' items = scrape(url, lang) if not items: print('No reviews') else: #Write in JSON values filename = url.split('Reviews-')[1][:-5] + '_' + lang write_in_json(items, filename + '.json')

Topic Modeling for Data Preprocessing notebook produces incorrect results.

I am running through the notebook as written on GitHub and the recommendation engine at the end of the notebook is giving me the same results as the beginning of the notebook (different from the results your notebook previews). You clearly had it set up correctly at some point before saving it. I have been trying to figure out what is going wrong but the data that your preprocessing drops seems to be dropped and from the set that is fed back into the recommendation engine. Please task me with getting better info for you and I'll do what I can.