susanli2016 / nlp-with-python Goto Github PK
View Code? Open in Web Editor NEWScikit-Learn, NLTK, Spacy, Gensim, Textblob and more
Scikit-Learn, NLTK, Spacy, Gensim, Textblob and more
I am running through the notebook as written on GitHub and the recommendation engine at the end of the notebook is giving me the same results as the beginning of the notebook (different from the results your notebook previews). You clearly had it set up correctly at some point before saving it. I have been trying to figure out what is going wrong but the data that your preprocessing drops seems to be dropped and from the set that is fed back into the recommendation engine. Please task me with getting better info for you and I'll do what I can.
for idx, topic in lda_model_tfidf.print_topics(-1):
print('Topic: {} Word: {}'.format(idx, topic))
whats the significance of -1 in first line...
Hello, I want to know the publication place and date of this paper, because I want it as my reference
Hello everyone,
After solving some issues with the code I successfully managed to extract a few csv files of reviews.
Sometimes, however, the extraction starts looping on the 5 latest reviews and never stops, hence never extracting the csv file.
Do you have any idea how to help me stop as soon as the review ids are duplicated? Or even just to produce a .csv file without having to wait for it to finish? I'm attaching my version of Susanli's code.
import requests
from bs4 import BeautifulSoup
import csv
import webbrowser
import io
def display(content, filename='output.html'):
with open(filename, 'wb') as f:
f.write(content)
webbrowser.open(filename)
def get_soup(session, url, show=False):
r = session.get(url)
if show:
display(r.content, 'temp.html')
if r.status_code != 200: # not OK
print('[get_soup] status code:', r.status_code)
else:
return BeautifulSoup(r.text, 'html.parser')
def post_soup(session, url, params, show=False):
'''Read HTML from server and convert to Soup'''
r = session.post(url, data=params)
if show:
display(r.content, 'temp.html')
if r.status_code != 200: # not OK
print('[post_soup] status code:', r.status_code)
else:
return BeautifulSoup(r.text, 'html.parser')
def scrape(url, lang='ALL'):
# create session to keep all cookies (etc.) between requests
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0',
})
items = parse(session, url + '?filterLang=' + lang)
return items
def parse(session, url):
'''Get number of reviews and start getting subpages with reviews'''
print('[parse] url:', url)
soup = get_soup(session, url)
if not soup:
print('[parse] no soup:', url)
return
num_reviews = soup.find('span', class_='reviews_header_count').text # get text
num_reviews = num_reviews[1:-1]
num_reviews = num_reviews.replace(',', '')
num_reviews = int(num_reviews) # convert text into integer
print('[parse] num_reviews ALL:', num_reviews)
url_template = url.replace('.html', '-or{}.html')
print('[parse] url_template:', url_template)
items = []
offset = 0
while(True):
subpage_url = url_template.format(offset)
subpage_items = parse_reviews(session, subpage_url)
if not subpage_items:
break
items += subpage_items
if len(subpage_items) < 5:
break
offset += 5
return items
def get_reviews_ids(soup):
items = soup.find_all('div', attrs={'data-reviewid': True})
if items:
reviews_ids = [x.attrs['data-reviewid'] for x in items][::2]
print('[get_reviews_ids] data-reviewid:', reviews_ids)
return reviews_ids
def get_more(session, reviews_ids):
url = 'https://www.tripadvisor.com/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=Hotel_Review'
payload = {
'reviews': ','.join(reviews_ids), # ie. "577882734,577547902,577300887",
#'contextChoice': 'DETAIL_HR', # ???
'widgetChoice': 'EXPANDED_HOTEL_REVIEW_HSX', # ???
'haveJses': 'earlyRequireDefine,amdearly,global_error,long_lived_global,apg-Hotel_Review,apg-Hotel_Review-in,bootstrap,desktop-rooms-guests-dust-en_US,responsive-calendar-templates-dust-en_US,taevents',
'haveCsses': 'apg-Hotel_Review-in',
'Action': 'install',
}
soup = post_soup(session, url, payload)
return soup
def parse_reviews(session, url):
'''Get all reviews from one page'''
print('[parse_reviews] url:', url)
soup = get_soup(session, url)
if not soup:
print('[parse_reviews] no soup:', url)
return
hotel_name = soup.find('h1').text
reviews_ids = get_reviews_ids(soup)
if not reviews_ids:
return
soup = get_more(session, reviews_ids)
if not soup:
print('[parse_reviews] no soup:', url)
return
items = []
for idx, review in enumerate(soup.find_all('div', class_='reviewSelector')):
badgets = review.find_all('span', class_='badgetext')
if len(badgets) > 0:
contributions = badgets[0].text
else:
contributions = '0'
if len(badgets) > 1:
helpful_vote = badgets[1].text
else:
helpful_vote = '0'
user_loc = review.select_one('div.userLoc strong')
if user_loc:
user_loc = user_loc.text
else:
user_loc = ''
bubble_rating = review.select_one('span.ui_bubble_rating')['class']
bubble_rating = bubble_rating[1].split('_')[-1]
item = {
'review_body': review.find('p', class_='partial_entry').text,
'review_date': review.find('span', class_='ratingDate')['title'], # 'ratingDate' instead of 'relativeDate'
}
items.append(item)
print('\n--- review ---\n')
for key,val in item.items():
print(' ', key, ':', val)
print()
return items
def write_in_csv(items, filename='results.csv',
headers=['hotel name', 'review title', 'review body',
'review date', 'contributions', 'helpful vote',
'user name' , 'user location', 'rating'],
mode='w'):
print('--- CSV ---')
with io.open(filename, mode, encoding="utf-8") as csvfile:
csv_file = csv.DictWriter(csvfile, headers)
if mode == 'w':
csv_file.writeheader()
csv_file.writerows(items)
DB_COLUMN = 'review_body'
DB_COLUMN1 = 'review_date'
start_urls = [
'https://www.tripadvisor.com/Restaurant_Review-g187813-d12072302-Reviews-Eataly_Trieste-Trieste_Province_of_Trieste_Friuli_Venezia_Giulia.html',
]
headers = [
DB_COLUMN,
DB_COLUMN1,
]
lang = 'it'
for url in start_urls:
# get all reviews for 'url' and 'lang'
items = scrape(url, lang)
if not items:
print('No reviews')
else:
# write in CSV
filename = url.split('Reviews-')[1][:-5] + '__' + lang
print('filename:', filename)
write_in_csv(items, filename + '.csv', headers, mode='w')
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_doc]
This lines sometimes works and sometime not.
like Sometimes i get result but sometimes it returns empty
bow_corpus is empty
Hello Susanli,
I tried to use your code but I received this error:
AttributeError Traceback (most recent call last)
<ipython-input-2-fede94e53ef8> in <module>
211
212 # get all reviews for 'url' and 'lang'
--> 213 items = scrape(url, lang)
214
215 if not items:
<ipython-input-2-fede94e53ef8> in scrape(url, lang)
48
49
---> 50 items = parse(session, url + '?filterLang=' + lang)
51
52 return items
<ipython-input-2-fede94e53ef8> in parse(session, url)
63 return
64
---> 65 num_reviews = soup.find('span', class_='reviews_header_count').text # get text
66 num_reviews = num_reviews[1:-1]
67 num_reviews = num_reviews.replace(',', '')
AttributeError: 'NoneType' object has no attribute 'text'
I'm university business professor but newbie with Python, could you help me to use your solution for scrap trip advisor hotel reviews ? Thanks in advance
Your code is very clear, THAKS FOR THAT!
But how to add more features to this model?
You only took words, but what if I want to add POS_TAGS also?
Hello @susanli2016. I'm really interested in the dataset you've used, specifically the corona_fake.csv
dataset. I would like to know from where you got it or how you were able to scrape it. Thanks!
I am getting an error on line 67, List index out of range. Please help me with it.
The error is as follows
AttributeError: 'module' object has no attribute 'tslib'
On running the code as it is on the provided data I get the following error :
python: 3.7
tf: 1.15
keras: 2.3.1
InvalidArgumentError: Incompatible shapes: [21504] vs. [1024,21]
[[{{node metrics_11/acc/Equal}}]]
Dear sir,
This is simple question but i am not sure about the result.I am working on topic model and using lda algorithm which is unsupervised learning algorithm.I have total 120 documents.i have divided documents into 100 and 20.Firstly, i passed 100 documents to th model and store the result.Then i saved the model and dictionary.
Now this saved model is used to predict the topics of 20 documents and got the result.
as we know saved model is used to predict the unseen data.
My question is can i pass first 100 documents (documents is used during saving the model)together with 20 documents(unseen document) to the saved model? Is it possible? will it effect the model performance.
I'm following the directions Spacy gives to install for Windows, Python 3, and from source (pip and conda have both given me errors that I've still been unable to resolve, directly from source seems to get the closest to actually installing). However, when I get to step 3 and enter export PYTHONPATH = pwd in the command line (with the quotes around pwd like it wants, it just messes up the formatting here), I get this error message:
export is not recognized as an internal or external command, operable program, or batch file.
Hai, I got this problem:
The numbers of items and labels differ: |x| = 18260, |y| = 1
when i try to my own indonesian dataset, i have:
X_train = (8675, 18260)
y_train = 8675
can you help me please?
I tried the github code as explained in this article...
https://towardsdatascience.com/automatically-generate-hotel-descriptions-with-lstm-afa37002d4fc?source=rss----7f60cf5620c9---4
But got an error in the last cell:
print(generate_text("hilton seattle downtown", 100, model, max_sequence_len))
AttributeError: 'str' object has no attribute 'desc'
All lines on this page except the last line works as expected:
https://github.com/susanli2016/NLP-with-Python/blob/master/Hotel%20Description%20Generation%20LSTM.ipynb
I was trying to use this same concept for my dataset and got the error in the title. Could anybody help? I also checked the shape of the dataframe which was (56, 11).
The error message is below:
ValueError Traceback (most recent call last)
in
1 print('5 random songs with the highest positive sentiment polarity: \n')
----> 2 hps = data.loc[data['polarity'] == 0.3, ['titles']].sample().values
3 for h in hps:
4 print(h[0])
~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in sample(self, n, frac, replace, weights, random_state, axis)
5059 )
5060
-> 5061 locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
5062 return self.take(locs, axis=axis)
5063
mtrand.pyx in mtrand.RandomState.choice()
ValueError: 'a' must be greater than 0 unless no samples are taken
import pandas as pd
As title suggests, how do I test on new values rather than simply on validation dataloader? Please and thank you.
Hello ,
That's really wonderful work .However when I saved the model and tried to predict it didn't work.
What is the best way to save it and use it fro later predictions.
for idx, topic in lda_model_tfidf.print_topics(-1):
print('Topic: {} Word: {}'.format(idx, topic))
whats the significance of -1 in first line...
i am getting the error after this code
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
"""
Build and train a basic RNN on x and y
:param input_shape: Tuple of input shape
:param output_sequence_length: Length of output sequence
:param english_vocab_size: Number of unique English words in the dataset
:param french_vocab_size: Number of unique French words in the dataset
:return: Keras model built, but not trained
"""
# TODO: Build the layers
learning_rate = 1e-3
input_seq = Input(input_shape[1:])
rnn = GRU(64, return_sequences = True)(input_seq)
logits = TimeDistributed(Dense(french_vocab_size))(rnn)
model = Model(input_seq, Activation('softmax')(logits))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
tests.test_simple_model(simple_model)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))
simple_rnn_model = simple_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
i am getting assertion error of not using the sparse cross entropy function ....using tensorflow 2.1.0 and python 3.6
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))
when I am running these lines in my document its showing indexerror please help
IndexError: index 0 is out of bounds for axis 0 with size 0
Hello Mam,
I tried to replicate your work for my own project.(My native language is Gujarati so I want to make small machine translation application). Your article in towardsdatascience.com and this github file very helpful till now, but I am facing and issue at fitting model.
Will you help me to solve the issue?
Thank you.
error is listed below
Hello, I want to save the final concatenated model, so that I could predict new sentences using that.
Many thanks for making these notebooks available!
Do you have the CSV files data/toxic_train.csv
, and data/toxic_test.csv
from https://towardsdatascience.com/classify-toxic-online-comments-with-lstm-and-glove-e455a58da9c7 (and: https://github.com/susanli2016/NLP-with-Python/blob/master/Toxic%20Comments%20LSTM%20GloVe.ipynb) available that I can use?
I'd like to run the notebook but I don't know where to find the source data.
Thanks again 🦄
Tried the script on lots of datasets , getting this error very frequently . pls help
You classified the documents based on the labels but did not use the 'Title'. Is it possible to classify text documents without even using the text?
I'm attempting to follow the steps and running into issues with this line
" simple_rnn_model = simple_model(tmp_x.shape, max_french_sequence_length, english_vocab_size, french_vocab_size) "
this seems to be a version difference in our tensorflow/keras versions. Could you potentially create a requirements.txt and include version numbers of these packages?
The repo looks really helpful but the README needs a lot of work
I made few corrections to get the code working, I masked out things I don't need, but left then for other users. This code collects User, Data, Title and Review - THERE ARE ISSUES WITH THE USER, sometimes appends the city/country.
I needed the code to export to JSON, so I created a list of lists;
[0,{user, date, review}],[1,{user,date,review]},[2,{user,date,review]}....
I also corrected some cosmetic issues with the URL as it requested or{} and or{0}, and added a way out of the infinite loop issue reported on another thread.
Initial code was VERY HELPFUL, but didn't do what I needed to do and has some bugs; I hope this helps other like me, that want to extract reports to JSON.
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import json
import webbrowser
import io
def display(content, filename='output.html'):
with open(filename, 'wb') as f:
f.write(content)
webbrowser.open(filename)
def get_soup(session, url, show=False):
r = session.get(url)
if show:
display(r.content, 'temp.html')
if r.status_code != 200: # not OK
print('[get_soup] status code:', r.status_code)
else:
return BeautifulSoup(r.text, 'html.parser')
def post_soup(session, url, params, show=False):
'''Read HTML from server and convert to Soup'''
r = session.post(url, data=params)
display(r.content, 'temp.html')
if r.status_code != 200: # not OK
print('[post_soup] status code:', r.status_code)
else:
def scrape(url, lang='ALL'):
# create session to keep all cookies (etc.) between requests
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0',
})
items = parse(session, url + '?filterLang=' + lang)
return items
def parse(session, url):
'''Get number of reviews and start getting subpages with reviews'''
print('[parse] url:', url)
soup = get_soup(session, url)
if not soup:
print('[parse] no soup:', url)
return
num_reviews = soup.find('span', class_='reviews_header_count').text # get text
num_reviews = num_reviews[1:-1]
num_reviews = num_reviews.replace(',', '')
num_reviews = int(num_reviews) # convert text into integer
print('[parse] num_reviews ALL:', num_reviews)
url_template = url.replace('.html', '-or{}.html')
#print('[parse] url_template:', url_template)
items = []
offset = 0
while(True):
subpage_url = url_template.format(offset)
if (not offset) or (offset == 0):
subpage_items = parse_reviews(session, url)
else:
subpage_items = parse_reviews(session, subpage_url)
if not subpage_items:
break
items += subpage_items
if len(subpage_items) < 10:
break
offset += 10
if offset > num_reviews:
break
return items
def get_reviews_ids(soup):
items = soup.find_all('div', attrs={'data-reviewid': True})
if items:
reviews_ids = [x.attrs['data-reviewid'] for x in items][::2]
print('[get_reviews_ids] data-reviewid:', reviews_ids)
return reviews_ids
def parse_reviews(session, url):
'''Get all reviews from one page'''
print('[parse_reviews] url:', url)
soup = get_soup(session, url)
if not soup:
print('[parse_reviews] no soup:', url)
return
hotel_name = soup.find('h1').text
reviews_ids = get_reviews_ids(soup)
if not reviews_ids:
return
#soup = get_more(session, reviews_ids)
if not soup:
print('[parse_reviews] no soup:', url)
return
items = []
for idx, review in enumerate(soup.find_all('div', class_='reviewSelector')):
#
# badgets = review.find_all('span', class_='badgetext')
# if len(badgets) > 0:
# contributions = badgets[0].text
# else:
# contributions = '0'
#
# if len(badgets) > 1:
# helpful_vote = badgets[1].text
# else:
# helpful_vote = '0'
# user_loc = review.select_one('div.userLoc strong')
# if user_loc:
# user_loc = user_loc.text
# else:
# user_loc = ''
#
# bubble_rating = review.select_one('span.ui_bubble_rating')['class']
# bubble_rating = bubble_rating[1].split('_')[-1]
item = {
'review_user' : review.find('div', class_='info_text').text,
'review_date' : review.find('span', class_='ratingDate').text, # 'ratingDate' instead of 'relativeDate'
'review_title': review.find('span', class_='noQuotes').text,
#'review_content': review.find('div', class_='entry').text,
}
items.append(item)
print('\n--- Review --- \n')
for key,val in item.items():
print(key+' :', val)
print()
return items
def write_in_json(items, filename,):
review = 0
reviews = []
total_reviews = len(items)
#Create a List of Lists
while (review < total_reviews) :
reviews.append((review,items[review]))
print(reviews[review])
review += 1
print('--- Writing File ---')
with open(filename, 'w') as jsonfile:
json.dump(reviews, jsonfile, ensure_ascii=False, indent=2)
print('--- Done ---')
start_urls = [
'https://www.tripadvisor.es/Restaurant_Review-g187452-d12206985-Reviews-Baiba_Cafe-Oviedo_Asturias.html',
]
lang = 'es'
for url in start_urls:
# get all reviews for 'url' and 'lang'
items = scrape(url, lang)
if not items:
print('No reviews')
else:
#Write in JSON values
filename = url.split('Reviews-')[1][:-5] + '_' + lang
write_in_json(items, filename + '.json')
when I use english and chinese(change frence to chinese), got the only ' '.
Hey, sudanli, I'm learning machine translation and I find your code very elegant and concise, so I want t o run your code just for learning. But in the "machine_translation.ipynb" file, you import the helper and tests packages, but I can not identify where is the code in your project, so would you help me with this? Thank you in advance.
Kindly tell where is the consumer_complaints_small.csv file?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.