Giter Site home page Giter Site logo

quora-miracle-nikki's Introduction

quora-miracle-nikki

kaggle competition for wsm final project

Prepare Your Data

put train.csv and test.csv in ./data directory

Dependecies

numpy, pandas, nltk, gensim, fuzzywuzzy, python-Levenshtein, scikit-learn, xgboost, only test on python3

File Description

Feature engineering:

preprocess.py: basic statistic features
const.py: some dict to do preprocessing
tfidf.py: bm25 cosine
fuzzy.py: fuzzy string similarity
magic.py: questions' duplicated features
d2v.py: Doc2Vec similarity (not improved)
deep_feature.py: Word2Vec features
bm25-word2vec.py: implementation of the paper [CIKM’15, Short Text Similarity with Word Embeddings]
magic_v2.py: questions' hash intersection count

Classification model:

predict.py: concat features, and train model to predict 
(now GBDT in sklearn or xgboost will get best result)

Feature Description

allow_features = ['noun_sub', 'verb_sub', 'keyword_match', 'word_difference', 
'noun_share', 'verb_share', 'keyword_match_ratio', 'bigram_match', 
'bigram_match_ratio', 'bigram_difference', 'trigram_match', 'trigram_match_ratio', 'trigram_difference', 
'bm25_cosine', 'q1_freq',  'q2_freq',
'qratio', 'wratio', 'ratio', 'partial_ratio', 'partial_token_sort_ratio', 
'token_set_ratio', 'token_sort_ratio',
'wmd', 'norm_wmd', 'sent2vec_cosine', 'sent2vec_cityblock', 'deep_bm25',
'q1_q2_intersect', 'q1_q2_wm_ratio',
'word_match' , 'tfidf_wm' , 'tfidf_wm_stops' ,'jaccard' , 
'wc_diff', 'wc_ratio', 'wc_diff_unique', 'wc_ratio_unique', 
'wc_diff_unq_stop', 'wc_ratio_unique_stop', 'same_start_word', 
'char_diff', 'char_diff_unq_stop', 'total_unique_words', 
'total_unq_words_stop', 'char_ratio', 'q_type1', 'q_type2',
'q1_pagerank', 'q2_pagerank', 
'glove_wmd', 'glove_norm_wmd', 'glove_sent2vec_cosine', 'glove_sent2vec_cityblock',
'fasttext_wmd', 'fasttext_norm_wmd', 'fasttext_sent2vec_cosine', 'fasttext_sent2vec_cityblock',
'glove_deep_bm25', 'fasttext_deep_bm25']

Get Deeper

Please go to official word2vec website on Google, and download GoogleNews-vectors-negative300.bin.gz. Next, you should modify model_path variable in deep_feature.py.

quora-miracle-nikki's People

Contributors

lonsilent avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.