Giter Site home page Giter Site logo

nlp-with-python's Introduction

NLP with Python

Scikit-Learn, NLTK, Spacy, Gensim, Textblob and more

nlp-with-python's People

Contributors

susanli2016 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp-with-python's Issues

About doc2bow

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_doc]
This lines sometimes works and sometime not.
like Sometimes i get result but sometimes it returns empty
bow_corpus is empty

About code

for idx, topic in lda_model_tfidf.print_topics(-1):
print('Topic: {} Word: {}'.format(idx, topic))
whats the significance of -1 in first line...

In bayseian

I have almost replicated your model and im getting error while finding precision,recall,F1 score values and i have done all kinds of debugging and still could not able to find out.
image

README

The repo looks really helpful but the README needs a lot of work

Getting an Error with shape

Hello Mam,
I tried to replicate your work for my own project.(My native language is Gujarati so I want to make small machine translation application). Your article in towardsdatascience.com and this github file very helpful till now, but I am facing and issue at fitting model.
Will you help me to solve the issue?
Thank you.
error is listed below
screenshot 199
screenshot 200
screenshot 201

Doubt regarding data pass to saved model

Dear sir,
This is simple question but i am not sure about the result.I am working on topic model and using lda algorithm which is unsupervised learning algorithm.I have total 120 documents.i have divided documents into 100 and 20.Firstly, i passed 100 documents to th model and store the result.Then i saved the model and dictionary.
Now this saved model is used to predict the topics of 20 documents and got the result.

as we know saved model is used to predict the unseen data.
My question is can i pass first 100 documents (documents is used during saving the model)together with 20 documents(unseen document) to the saved model? Is it possible? will it effect the model performance.

Predicting from model

Hello ,
That's really wonderful work .However when I saved the model and tried to predict it didn't work.
What is the best way to save it and use it fro later predictions.

How to add POS_TAG feature?

Your code is very clear, THAKS FOR THAT!

But how to add more features to this model?
You only took words, but what if I want to add POS_TAGS also?

Error in line 67

I am getting an error on line 67, List index out of range. Please help me with it.

Looping

Hello everyone,
After solving some issues with the code I successfully managed to extract a few csv files of reviews.
Sometimes, however, the extraction starts looping on the 5 latest reviews and never stops, hence never extracting the csv file.
Do you have any idea how to help me stop as soon as the review ids are duplicated? Or even just to produce a .csv file without having to wait for it to finish? I'm attaching my version of Susanli's code.


import requests
from bs4 import BeautifulSoup
import csv
import webbrowser
import io

def display(content, filename='output.html'):
	with open(filename, 'wb') as f:
		 f.write(content)
		 webbrowser.open(filename)

		 
def get_soup(session, url, show=False):
	r = session.get(url)
	if show:
		display(r.content, 'temp.html')
	if r.status_code != 200: # not OK
		print('[get_soup] status code:', r.status_code)
	else:
		return BeautifulSoup(r.text, 'html.parser')

	
def post_soup(session, url, params, show=False):
    '''Read HTML from server and convert to Soup'''

    r = session.post(url, data=params)
    
    if show:
        display(r.content, 'temp.html')

    if r.status_code != 200: # not OK
        print('[post_soup] status code:', r.status_code)
    else:
        return BeautifulSoup(r.text, 'html.parser')

def scrape(url, lang='ALL'):

    # create session to keep all cookies (etc.) between requests
    session = requests.Session()

    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0',
    })


    items = parse(session, url + '?filterLang=' + lang)

    return items

def parse(session, url):
    '''Get number of reviews and start getting subpages with reviews'''

    print('[parse] url:', url)

    soup = get_soup(session, url)

    if not soup:
        print('[parse] no soup:', url)
        return

    num_reviews = soup.find('span', class_='reviews_header_count').text # get text    
    num_reviews = num_reviews[1:-1] 
    num_reviews = num_reviews.replace(',', '')
    num_reviews = int(num_reviews) # convert text into integer
    print('[parse] num_reviews ALL:', num_reviews)

    url_template = url.replace('.html', '-or{}.html')
    print('[parse] url_template:', url_template)

    items = []

    offset = 0

    while(True):
        subpage_url = url_template.format(offset)

        subpage_items = parse_reviews(session, subpage_url)
        if not subpage_items:
            break

        items += subpage_items

        if len(subpage_items) < 5:
            break

        offset += 5

    return items

def get_reviews_ids(soup):

    items = soup.find_all('div', attrs={'data-reviewid': True})

    if items:
        reviews_ids = [x.attrs['data-reviewid'] for x in items][::2]
        print('[get_reviews_ids] data-reviewid:', reviews_ids)
        return reviews_ids

def get_more(session, reviews_ids):

    url = 'https://www.tripadvisor.com/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=Hotel_Review'

    payload = {
        'reviews': ','.join(reviews_ids), # ie. "577882734,577547902,577300887",
        #'contextChoice': 'DETAIL_HR', # ???
        'widgetChoice': 'EXPANDED_HOTEL_REVIEW_HSX', # ???
        'haveJses': 'earlyRequireDefine,amdearly,global_error,long_lived_global,apg-Hotel_Review,apg-Hotel_Review-in,bootstrap,desktop-rooms-guests-dust-en_US,responsive-calendar-templates-dust-en_US,taevents',
        'haveCsses': 'apg-Hotel_Review-in',
        'Action': 'install',
    }

    soup = post_soup(session, url, payload)

    return soup

def parse_reviews(session, url):
    '''Get all reviews from one page'''

    print('[parse_reviews] url:', url)

    soup =  get_soup(session, url)

    if not soup:
        print('[parse_reviews] no soup:', url)
        return

    hotel_name = soup.find('h1').text

    reviews_ids = get_reviews_ids(soup)
    if not reviews_ids:
        return

    soup = get_more(session, reviews_ids)

    if not soup:
        print('[parse_reviews] no soup:', url)
        return

    items = []

    for idx, review in enumerate(soup.find_all('div', class_='reviewSelector')):

        badgets = review.find_all('span', class_='badgetext')
        if len(badgets) > 0:
            contributions = badgets[0].text
        else:
            contributions = '0'

        if len(badgets) > 1:
            helpful_vote = badgets[1].text
        else:
            helpful_vote = '0'
        user_loc = review.select_one('div.userLoc strong')
        if user_loc:
            user_loc = user_loc.text
        else:
            user_loc = ''
            
        bubble_rating = review.select_one('span.ui_bubble_rating')['class']
        bubble_rating = bubble_rating[1].split('_')[-1]

        item = {
            'review_body': review.find('p', class_='partial_entry').text,
            'review_date': review.find('span', class_='ratingDate')['title'], # 'ratingDate' instead of 'relativeDate'
        }

        items.append(item)
        print('\n--- review ---\n')
        for key,val in item.items():
            print(' ', key, ':', val)

    print()

    return items

def write_in_csv(items, filename='results.csv',
                  headers=['hotel name', 'review title', 'review body',
                           'review date', 'contributions', 'helpful vote',
                           'user name' , 'user location', 'rating'],
                  mode='w'):

    print('--- CSV ---')

    with io.open(filename, mode, encoding="utf-8") as csvfile:
        csv_file = csv.DictWriter(csvfile, headers)

        if mode == 'w':
            csv_file.writeheader()

        csv_file.writerows(items)

DB_COLUMN   = 'review_body'
DB_COLUMN1 = 'review_date'

start_urls = [
    'https://www.tripadvisor.com/Restaurant_Review-g187813-d12072302-Reviews-Eataly_Trieste-Trieste_Province_of_Trieste_Friuli_Venezia_Giulia.html',
]

headers = [ 
    DB_COLUMN, 
    DB_COLUMN1, 
]

lang = 'it'

for url in start_urls:

    # get all reviews for 'url' and 'lang'
    items = scrape(url, lang)

    if not items:
        print('No reviews')
    else:
        # write in CSV
        filename = url.split('Reviews-')[1][:-5] + '__' + lang
        print('filename:', filename)
        write_in_csv(items, filename + '.csv', headers, mode='w')

Consulting on missing packages

Hey, sudanli, I'm learning machine translation and I find your code very elegant and concise, so I want t o run your code just for learning. But in the "machine_translation.ipynb" file, you import the helper and tests packages, but I can not identify where is the code in your project, so would you help me with this? Thank you in advance.

string object has no attribute desc

I tried the github code as explained in this article...
https://towardsdatascience.com/automatically-generate-hotel-descriptions-with-lstm-afa37002d4fc?source=rss----7f60cf5620c9---4

But got an error in the last cell:
print(generate_text("hilton seattle downtown", 100, model, max_sequence_len))
AttributeError: 'str' object has no attribute 'desc'

All lines on this page except the last line works as expected:
https://github.com/susanli2016/NLP-with-Python/blob/master/Hotel%20Description%20Generation%20LSTM.ipynb

Newbie

Hello Susanli,
I tried to use your code but I received this error:

AttributeError                            Traceback (most recent call last)
<ipython-input-2-fede94e53ef8> in <module>
    211 
    212     # get all reviews for 'url' and 'lang'
--> 213     items = scrape(url, lang)
    214 
    215     if not items:

<ipython-input-2-fede94e53ef8> in scrape(url, lang)
     48 
     49 
---> 50     items = parse(session, url + '?filterLang=' + lang)
     51 
     52     return items

<ipython-input-2-fede94e53ef8> in parse(session, url)
     63         return
     64 
---> 65     num_reviews = soup.find('span', class_='reviews_header_count').text # get text
     66     num_reviews = num_reviews[1:-1]
     67     num_reviews = num_reviews.replace(',', '')

AttributeError: 'NoneType' object has no attribute 'text'

I'm university business professor but newbie with Python, could you help me to use your solution for scrap trip advisor hotel reviews ? Thanks in advance

requirements.txt

I'm attempting to follow the steps and running into issues with this line

" simple_rnn_model = simple_model(tmp_x.shape, max_french_sequence_length, english_vocab_size, french_vocab_size) "

this seems to be a version difference in our tensorflow/keras versions. Could you potentially create a requirements.txt and include version numbers of these packages?

ValueError: 'a' must be greater than 0 unless no samples are taken

I was trying to use this same concept for my dataset and got the error in the title. Could anybody help? I also checked the shape of the dataframe which was (56, 11).

The error message is below:


ValueError Traceback (most recent call last)
in
1 print('5 random songs with the highest positive sentiment polarity: \n')
----> 2 hps = data.loc[data['polarity'] == 0.3, ['titles']].sample().values
3 for h in hps:
4 print(h[0])

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in sample(self, n, frac, replace, weights, random_state, axis)
5059 )
5060
-> 5061 locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
5062 return self.take(locs, axis=axis)
5063

mtrand.pyx in mtrand.RandomState.choice()

ValueError: 'a' must be greater than 0 unless no samples are taken

Text_classification_with_BERT.ipynb

You classified the documents based on the labels but did not use the 'Title'. Is it possible to classify text documents without even using the text?

machine translation code issue

i am getting the error after this code

def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
"""
Build and train a basic RNN on x and y
:param input_shape: Tuple of input shape
:param output_sequence_length: Length of output sequence
:param english_vocab_size: Number of unique English words in the dataset
:param french_vocab_size: Number of unique French words in the dataset
:return: Keras model built, but not trained
"""
# TODO: Build the layers
learning_rate = 1e-3
input_seq = Input(input_shape[1:])
rnn = GRU(64, return_sequences = True)(input_seq)
logits = TimeDistributed(Dense(french_vocab_size))(rnn)
model = Model(input_seq, Activation('softmax')(logits))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])

return model

tests.test_simple_model(simple_model)

Reshaping the input to work with a basic RNN

tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

Train the neural network

simple_rnn_model = simple_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

Print prediction(s)

print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

error

i am getting assertion error of not using the sparse cross entropy function ....using tensorflow 2.1.0 and python 3.6

IndexError: index 0 is out of bounds for axis 0 with size 0

doc_sample = documents[documents['index'] == 4310].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

when I am running these lines in my document its showing indexerror please help
IndexError: index 0 is out of bounds for axis 0 with size 0

Error while training (InvalidArgumentError)

On running the code as it is on the provided data I get the following error :

python: 3.7
tf: 1.15
keras: 2.3.1

InvalidArgumentError: Incompatible shapes: [21504] vs. [1024,21]
[[{{node metrics_11/acc/Equal}}]]

Windows gives error when installing from source

I'm following the directions Spacy gives to install for Windows, Python 3, and from source (pip and conda have both given me errors that I've still been unable to resolve, directly from source seems to get the closest to actually installing). However, when I get to step 3 and enter export PYTHONPATH = pwd in the command line (with the quotes around pwd like it wants, it just messes up the formatting here), I get this error message:

export is not recognized as an internal or external command, operable program, or batch file.

About the dataset

Hello @susanli2016. I'm really interested in the dataset you've used, specifically the corona_fake.csv dataset. I would like to know from where you got it or how you were able to scrape it. Thanks!

Corrections done and exporting to JSON

I made few corrections to get the code working, I masked out things I don't need, but left then for other users. This code collects User, Data, Title and Review - THERE ARE ISSUES WITH THE USER, sometimes appends the city/country.

I needed the code to export to JSON, so I created a list of lists;
[0,{user, date, review}],[1,{user,date,review]},[2,{user,date,review]}....

I also corrected some cosmetic issues with the URL as it requested or{} and or{0}, and added a way out of the infinite loop issue reported on another thread.

Initial code was VERY HELPFUL, but didn't do what I needed to do and has some bugs; I hope this helps other like me, that want to extract reports to JSON.

# -*- coding: utf-8 -*-
import requests from bs4 import BeautifulSoup import json import webbrowser import io
def display(content, filename='output.html'):
with open(filename, 'wb') as f:
f.write(content)
webbrowser.open(filename)
def get_soup(session, url, show=False):
r = session.get(url)
if show:
display(r.content, 'temp.html')
if r.status_code != 200: # not OK
print('[get_soup] status code:', r.status_code)
else:
return BeautifulSoup(r.text, 'html.parser')
def post_soup(session, url, params, show=False): '''Read HTML from server and convert to Soup'''
r = session.post(url, data=params)
display(r.content, 'temp.html') if r.status_code != 200: # not OK
print('[post_soup] status code:', r.status_code)
else:
def scrape(url, lang='ALL'):
# create session to keep all cookies (etc.) between requests
session = requests.Session()
session.headers.update({ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0', })
items = parse(session, url + '?filterLang=' + lang)
return items
def parse(session, url):
'''Get number of reviews and start getting subpages with reviews'''
print('[parse] url:', url)
soup = get_soup(session, url)
if not soup: print('[parse] no soup:', url) return
num_reviews = soup.find('span', class_='reviews_header_count').text # get text
num_reviews = num_reviews[1:-1]
num_reviews = num_reviews.replace(',', '')
num_reviews = int(num_reviews) # convert text into integer
print('[parse] num_reviews ALL:', num_reviews)
url_template = url.replace('.html', '-or{}.html') #print('[parse] url_template:', url_template)
items = []
offset = 0
while(True):
subpage_url = url_template.format(offset)
if (not offset) or (offset == 0):
subpage_items = parse_reviews(session, url)
else:
subpage_items = parse_reviews(session, subpage_url)
if not subpage_items: break
items += subpage_items
if len(subpage_items) < 10: break
offset += 10
if offset > num_reviews:
break
return items
def get_reviews_ids(soup):
items = soup.find_all('div', attrs={'data-reviewid': True})
if items:
reviews_ids = [x.attrs['data-reviewid'] for x in items][::2]
print('[get_reviews_ids] data-reviewid:', reviews_ids)
return reviews_ids
def parse_reviews(session, url): '''Get all reviews from one page'''
print('[parse_reviews] url:', url)
soup = get_soup(session, url)
if not soup:
print('[parse_reviews] no soup:', url)
return
hotel_name = soup.find('h1').text
reviews_ids = get_reviews_ids(soup)
if not reviews_ids:
return
#soup = get_more(session, reviews_ids) if not soup: print('[parse_reviews] no soup:', url) return
items = []
for idx, review in enumerate(soup.find_all('div', class_='reviewSelector')):
#
# badgets = review.find_all('span', class_='badgetext')
# if len(badgets) > 0:
# contributions = badgets[0].text
# else:
# contributions = '0'
# # if len(badgets) > 1: # helpful_vote = badgets[1].text # else:
# helpful_vote = '0'
# user_loc = review.select_one('div.userLoc strong')
# if user_loc:
# user_loc = user_loc.text
# else:
# user_loc = ''
# # bubble_rating = review.select_one('span.ui_bubble_rating')['class'] # bubble_rating = bubble_rating[1].split('_')[-1]
item = {
'review_user' : review.find('div', class_='info_text').text,
'review_date' : review.find('span', class_='ratingDate').text, # 'ratingDate' instead of 'relativeDate'
'review_title': review.find('span', class_='noQuotes').text,
#'review_content': review.find('div', class_='entry').text,
}
items.append(item)
print('\n--- Review --- \n')
for key,val in item.items(): print(key+' :', val)
print()
return items
def write_in_json(items, filename,):
review = 0 reviews = [] total_reviews = len(items)
#Create a List of Lists
while (review < total_reviews) :
reviews.append((review,items[review]))
print(reviews[review])
review += 1
print('--- Writing File ---')
with open(filename, 'w') as jsonfile:
json.dump(reviews, jsonfile, ensure_ascii=False, indent=2)
print('--- Done ---')
start_urls = [
'https://www.tripadvisor.es/Restaurant_Review-g187452-d12206985-Reviews-Baiba_Cafe-Oviedo_Asturias.html',
]
lang = 'es'
for url in start_urls:
# get all reviews for 'url' and 'lang' items = scrape(url, lang)
if not items:
print('No reviews')
else:
#Write in JSON values
filename = url.split('Reviews-')[1][:-5] + '_' + lang
write_in_json(items, filename + '.json')

Topic Modeling for Data Preprocessing notebook produces incorrect results.

I am running through the notebook as written on GitHub and the recommendation engine at the end of the notebook is giving me the same results as the beginning of the notebook (different from the results your notebook previews). You clearly had it set up correctly at some point before saving it. I have been trying to figure out what is going wrong but the data that your preprocessing drops seems to be dropped and from the set that is fed back into the recommendation engine. Please task me with getting better info for you and I'll do what I can.

About code

for idx, topic in lda_model_tfidf.print_topics(-1):
print('Topic: {} Word: {}'.format(idx, topic))
whats the significance of -1 in first line...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.