aerdem4 / kaggle-quora-dup Goto Github PK

View Code? Open in Web Editor NEW

136.0 136.0 53.0 65 KB

Solution to Kaggle's Quora Duplicate Question Detection Competition

License: MIT License

Python 100.00%

neural-network nlp regex siamese-lstm siamese-network

kaggle-quora-dup's People

Contributors

Stargazers

Watchers

Forkers

canoefzh lampts futureer qiuhuigithub allen840707 shanjgit 0ri0nx wushicanasl lilitom tjacowalvis airxiechao ofrik ashishlal soldni sw1001 zt2191stat kagglesolutions prokopyev pchankh khaledto geunyounglim yxiao1994 michael-wzhu dongheerhie diosguo sangensong aiedward stevewho dfenglei neuron888 rahasayantan sathvisiva dayeren lrpopeyou guihui jainnitk dylanxia2017 arunkumarramanan ibozkurt79 arvind-india tarsbase machineiearning crislanio battlegg aditi-bhole data-science-ai-open-source xxkkevin kurhula aniruthanv santhosh432 rupali-goyal iq-scm

kaggle-quora-dup's Issues

get_kcore_dict()'s return value might be wrong

I have a doubt on this line.
https://github.com/aerdem4/kaggle-quora-dup/blob/master/non_nlp_feature_extraction.py#L41

Perhaps, this line should be something like:

    return dict(zip(df_output["qid"], df_output["kcore"]))

so that the returned dict has qids as keys properly.

(Your code is very helpful. Thank you very much!)

Cannot properly generate kcore_dict in non-NLP features

Hi, I downloaded your code and tried to play with it. When I run the script non_nlp_feature_extraction.py, I encountered the following error. I am not familliar with the Graph lib you used here. Would you please have a look on it and tell me what's wrong with the code?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-df736cd8be48> in <module>()
     10 print("Calculating kcore features...")
     11 all_df = pd.concat([train_df, test_df])
---> 12 kcore_dict = get_kcore_dict(all_df)
     13 train_df = get_kcore_features(train_df, kcore_dict)
     14 test_df = get_kcore_features(test_df, kcore_dict)

<ipython-input-5-4e9e86a38b2a> in get_kcore_dict(df)
     25     print(type(g.nodes()))
     26     print(g.nodes())
---> 27     df_output = pd.DataFrame(data=g.nodes(), columns=["qid"])
     28     df_output["kcore"] = 0
     29     for k in range(2, NB_CORES + 1):

D:\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    352                                          copy=False)
    353             else:
--> 354                 raise ValueError('DataFrame constructor not properly called!')
    355 
    356         NDFrame.__init__(self, mgr, fastpath=True)

ValueError: DataFrame constructor not properly called!

I think the problem is with this function.

def get_kcore_dict(df):
    g = nx.Graph()
    g.add_nodes_from(df.qid1)
    edges = list(df[["qid1", "qid2"]].to_records(index=False))
    g.add_edges_from(edges)
    g.remove_edges_from(g.selfloop_edges())
    print(type(g.nodes()))
    print(g.nodes())
    df_output = pd.DataFrame(data=g.nodes(), columns=["qid"])   <==== THIS LINE
    df_output["kcore"] = 0
    for k in range(2, NB_CORES + 1):
        ck = nx.k_core(g, k=k).nodes()
        print("kcore", k)
        df_output.ix[df_output.qid.isin(ck), "kcore"] = k

    return df_output.to_dict()["kcore"]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>

I am trying to run model.py but i am getting following error:

D:\imad_web\kaggle-quora-dup_24_position>python model.py
C:\ProgramData\Anaconda3\lib\site-packages\h5py_init_.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Creating the vocabulary of words occurred more than 100
Traceback (most recent call last):
File "model.py", line 122, in
embeddings_index = get_embedding()
File "model.py", line 55, in get_embedding
for line in f:
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to
`
def get_embedding():
embeddings_index = {}
f = open(EMBEDDING_FILE)
for line in f: ##########################line 55
values = line.split()
word = values[0]
if len(values) == EMBEDDING_DIM + 1 and word in top_words:
coefs = np.asarray(values[1:], dtype="float32")
embeddings_index[word] = coefs
f.close()
return embeddings_index

`
vectorizer = CountVectorizer(lowercase=False, token_pattern="\S+", min_df=MIN_WORD_OCCURRENCE)
vectorizer.fit(all_questions)
top_words = set(vectorizer.vocabulary_.keys())
top_words.add(REPLACE_WORD)

embeddings_index = get_embedding() ##############line 122
print("Words are not found in the embedding:", top_words - embeddings_index.keys())
top_words = embeddings_index.keys()

what is your offline score

Hello！
First of all, thank you very much for your open source.
I am very interested in your implementation, so I downloaded your code and implemented it on my machine. My GPU is TiTanic Xp.
After I finished running the file "model.py" with setting epochs=15, and I got the result of the validation set loss of about 0.203 (training set loss is about 0.17). The results of the model training don't seem so good!
I find you have a ranking of 23 on kaggle and an online score of 0.12988, so I would like to ask, what is your offline score? How can I use this open source code to achieve the same validation loss as you?

Looking forward to your reply.Thanks again!
@aerdem4

aerdem4 / kaggle-quora-dup Goto Github PK

kaggle-quora-dup's People

Contributors

Stargazers

Watchers

Forkers

kaggle-quora-dup's Issues

get_kcore_dict()'s return value might be wrong

Cannot properly generate kcore_dict in non-NLP features

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>

what is your offline score

at preds = model.predict([test_data_1, test_data_2, features_test], batch_size=BATCH_SIZE, verbose=1)

Possible to include some trained weights?

Model reference

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7908: character maps to <undefined>

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent