makcedward / nlp Goto Github PK
View Code? Open in Web Editor NEW:memo: This repository recorded my NLP journey.
Home Page: https://makcedward.github.io/
:memo: This repository recorded my NLP journey.
Home Page: https://makcedward.github.io/
Hi,
Just found your blog and I there are many useful information
I just found that you put ULMFiT under Openai in your readme table while according to I know, it is from fastai.
Thanks for compiling and sharing your knowledge.
Line 105 in 2f12277
In the above line, you assign the result of the LSA to the variables x_train_lda, x_test_lda
. Shouldn't it be lsa in both cases?
Hi, I am getting an error while generating InferSent embeddings. The error is as follows, with details at the end of this email
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 11: invalid start byte
The error occurs after I run infer_sent_embs.build_vocab(x_train, tokenize=True)
.
Note that I ran your code in Google Colab. Also note that the links to InferSent in the python file infersent.py also need to be updated (expired links).
The new links are
INFERSENT_GLOVE_MODEL_URL = 'https://dl.fbaipublicfiles.com/infersent/infersent1.pkl'
INFERSENT_FASTTEXT_MODEL_URL = 'https://dl.fbaipublicfiles.com/infersent/infersent2.pkl'
UnicodeDecodeError Traceback (most recent call last)
in ()
----> 1 infer_sent_embs.build_vocab(x_train, tokenize=True)
2 x_train_t = infer_sent_embs.encode(x_train, tokenize=True)
3 x_test_t = infer_sent_embs.encode(x_test, tokenize=True)
3 frames
/usr/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 11: invalid start byte
`
@makcedward I am trying to retrieve similar documents from the given document. Here is the code snippet:
x_train_t = doc2vec_embs.encode(documents=x_train)
x_test_t = doc2vec_embs.encode(documents=x_test)
def similiar_docs(doc2vec_embs, test_sample):
sims = doc2vec_embs.model.docvecs.most_similar([test_sample], topn=1)
for s in sims:
print(x_train[s[0]])
test_sample = x_test_t[0]
print(x_test[0])
similiar_docs(doc2vec_embs, test_sample)
However, the retrieved docs aren't similar. Am I missing something here?
Hello,
Please help me understand how to execute "from aion.util.spell_check import SymSpell". I tried to use the sys,os but got lost.
How can i get glove.6B.50d.vec which is imported in sample/nlp-word_mover_distance.ipynb of this repository
I can't find and use the exact code that you have used. any idea on that?
Hi, thank you for the awesome code on character embedding model, I had a lot of fun playing with it.
One little suggestion on the code, in CharCNN, def build_char_dictionary, you put chars = list(set(chars)). This will mess up the chars order in char_dict. Every time I start a new notebook, the chars' order will be different, therefore result in a different dictionary. What happened to me is I tried to load my trained keras model in a new notebook and found out that the model is not working. In the end I figured out it is because of my char_indices in preprocess step in totally different than old one. I didn't save the old char_indices before so I have no choice but to retrain the model, lol.
explainer = shap.DeepExplainer(pipeline.model, encoded_x_train[:10])
shap_values = explainer.shap_values(encoded_x_test[:1])
x_test_words = prepare_explanation_words(pipeline, encoded_x_test)
y_pred = pipeline.predict(x_test[:1])
print('Actual Category: %s, Predict Category: %s' % (y_test[0], y_pred[0]))
shap.force_plot(explainer.expected_value[0], shap_values[0][0], x_test_words[0])
RETURNS:
ValueError: Dimensions must be equal, but are 10 and 100 for '{{node gradient_tape/functional_1/global_max_pooling1d/truediv_1}} = RealDiv[T=DT_FLOAT](gradient_tape/functional_1/global_max_pooling1d/sub_1, gradient_tape/functional_1/global_max_pooling1d/sub)' with input shapes: [10,512], [10,100,512].
Hello Edward,
Thank you for the great article and detailed blog post about ELMO. I used one of the resources you posted for ELMO in keras before, I thought that implementation lacked some things, like ability to normalize vectors, your implementation is certainly better than the one in resources.
When I tried to import aion, I'm facing an error which says "ModuleNotFoundError: No module named 'aion'". I know that this is a fairly common packaging issue, but I wasn't able to resolve it, I didn't find any reliable online source either to be able to fix it. Please let me know if you have faced such an issue at all, any pointers will be appreciated.
I was able to successfully download aion by "pip install aion"
I'm using:
Python 3.6.7
Windows 10
Thanks
The Cosine Similarity code has a typographic error.
States transfaormed_results[i] and should be transformed_results[i]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.