aerdem4 / kaggle-quora-dup Goto Github PK
View Code? Open in Web Editor NEWSolution to Kaggle's Quora Duplicate Question Detection Competition
License: MIT License
Solution to Kaggle's Quora Duplicate Question Detection Competition
License: MIT License
I have a doubt on this line.
https://github.com/aerdem4/kaggle-quora-dup/blob/master/non_nlp_feature_extraction.py#L41
Perhaps, this line should be something like:
return dict(zip(df_output["qid"], df_output["kcore"]))
so that the returned dict has qids as keys properly.
(Your code is very helpful. Thank you very much!)
Hi, I downloaded your code and tried to play with it. When I run the script non_nlp_feature_extraction.py
, I encountered the following error. I am not familliar with the Graph lib you used here. Would you please have a look on it and tell me what's wrong with the code?
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-df736cd8be48> in <module>()
10 print("Calculating kcore features...")
11 all_df = pd.concat([train_df, test_df])
---> 12 kcore_dict = get_kcore_dict(all_df)
13 train_df = get_kcore_features(train_df, kcore_dict)
14 test_df = get_kcore_features(test_df, kcore_dict)
<ipython-input-5-4e9e86a38b2a> in get_kcore_dict(df)
25 print(type(g.nodes()))
26 print(g.nodes())
---> 27 df_output = pd.DataFrame(data=g.nodes(), columns=["qid"])
28 df_output["kcore"] = 0
29 for k in range(2, NB_CORES + 1):
D:\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
352 copy=False)
353 else:
--> 354 raise ValueError('DataFrame constructor not properly called!')
355
356 NDFrame.__init__(self, mgr, fastpath=True)
ValueError: DataFrame constructor not properly called!
I think the problem is with this function.
def get_kcore_dict(df):
g = nx.Graph()
g.add_nodes_from(df.qid1)
edges = list(df[["qid1", "qid2"]].to_records(index=False))
g.add_edges_from(edges)
g.remove_edges_from(g.selfloop_edges())
print(type(g.nodes()))
print(g.nodes())
df_output = pd.DataFrame(data=g.nodes(), columns=["qid"]) <==== THIS LINE
df_output["kcore"] = 0
for k in range(2, NB_CORES + 1):
ck = nx.k_core(g, k=k).nodes()
print("kcore", k)
df_output.ix[df_output.qid.isin(ck), "kcore"] = k
return df_output.to_dict()["kcore"]
I am trying to run model.py but i am getting following error:
D:\imad_web\kaggle-quora-dup_24_position>python model.py
C:\ProgramData\Anaconda3\lib\site-packages\h5py_init_.py:36: FutureWarning: Conversion of the second argument of issubdtype from float
to np.floating
is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type
.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Creating the vocabulary of words occurred more than 100
Traceback (most recent call last):
File "model.py", line 122, in
embeddings_index = get_embedding()
File "model.py", line 55, in get_embedding
for line in f:
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to
`
def get_embedding():
embeddings_index = {}
f = open(EMBEDDING_FILE)
for line in f: ##########################line 55
values = line.split()
word = values[0]
if len(values) == EMBEDDING_DIM + 1 and word in top_words:
coefs = np.asarray(values[1:], dtype="float32")
embeddings_index[word] = coefs
f.close()
return embeddings_index
`
`
vectorizer = CountVectorizer(lowercase=False, token_pattern="\S+", min_df=MIN_WORD_OCCURRENCE)
vectorizer.fit(all_questions)
top_words = set(vectorizer.vocabulary_.keys())
top_words.add(REPLACE_WORD)
embeddings_index = get_embedding() ##############line 122
print("Words are not found in the embedding:", top_words - embeddings_index.keys())
top_words = embeddings_index.keys()
`
Hello!
First of all, thank you very much for your open source.
I am very interested in your implementation, so I downloaded your code and implemented it on my machine. My GPU is TiTanic Xp.
After I finished running the file "model.py" with setting epochs=15, and I got the result of the validation set loss of about 0.203 (training set loss is about 0.17). The results of the model training don't seem so good!
I find you have a ranking of 23 on kaggle and an online score of 0.12988, so I would like to ask, what is your offline score? How can I use this open source code to achieve the same validation loss as you?
Looking forward to your reply.Thanks again!
@aerdem4
preds = model.predict([test_data_1, test_data_2, features_test], batch_size=BATCH_SIZE, verbose=1)
getting error that
ValueError: Error when checking : expected input_3 to have shape (None, 24) but got array with shape (200001, 31)
Hi I was interested in playing about with this model, but I was wondering instead of hiring an AWS server to run the model, is it possible to include some trained weights so we can use the model without training it?
Thanks
May I asked model in your solution,refer to any other reference ,such as paper?Thanks
How to get out of this ????
can somebody help me??
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.