Giter Site home page Giter Site logo

wiki-sem-500's Introduction

This repository contains the WikiSem500 dataset described in "Automated Generation of Multilingual Clusters for the Evaluation of Distributed Representations" by Philip Blair, Yuval Merhav, and Joel Barry.

The test groups themselves can be found in wiki-sem-500.tar.gz (wiki-sem-500-tokenized.tar.gz is pre-tokenized). The structure of the archive is as follows:

wiki-sem-500
├── de
│   ├── Q101352.txt
│   ├── Q105000.txt
│   ├── Q1061151.txt
│   ├── Q1065118.txt
│   ...
├── en
│   ├── Q101352.txt
│   ...
├── es
│   ├── Q101352.txt
│   ...
├── ja
│   ├── Q101352.txt
│   ...
├── zh
│   ├── Q101352.txt
│   ...

Note that while many classes are available in multiple languages, there are many that are not.

Each file contains a cluster, followed by a sequence of one or more outliers:

$ cat en/Q1060829.txt

Madison_Square_Garden
Walt_Disney_Concert_Hall
Olympia
Kodak_Theatre
Carnegie_Hall
Auditorio_de_Tenerife
Royal_Albert_Hall
Palau_de_la_Música_Catalana

CBGB
Buena_Vista_Social_Club
Arena_di_Verona
Barbican_Centre
RMS
HMHS

Running the Evaluation Script

To run the evaluation script, navigate to this directory in a virtualenv and run install_dependencies.py . The embeddings are driven by a partial fork of polyglot.

Once the dependencies are installed, unpack the tokenized dataset at a location of your choice (say, dataset/). A word2vec binary embedding can then be evaluated as follows:

(venv2) $ ./evaluate.py -w2v vectors.bin -d dataset/en -b

GloVe and Gensim embeddings are also supported. Here is the full help message for evaluate.py:

usage: evaluate.py [-h] (-w2v WORD2VEC | -gv GLOVE | -gs GENSIM) -d DATASET
                   [-b] [-p] [-goog] [-ci CASE_INSENSITIVE]

Scoring script for outlier detection

optional arguments:
  -h, --help            show this help message and exit
  -w2v WORD2VEC, --word2vec WORD2VEC
                        Specify word2vec embedding file
  -gv GLOVE, --glove GLOVE
                        Specify GloVe embedding file
  -gs GENSIM, --gensim GENSIM
                        Specify Gensim embedding file
  -d DATASET, --dataset DATASET
                        Path to outlier dataset
  -b, --binary          Indicates that the embedding file is binary (ignored
                        for GloVe files)
  -p, --phrases         Indicates that the embedding file supports phrases
  -goog, --google-news  Indicates that the embeddings have been normalized in
                        the same fashion as the Google News word2vec
                        embeddings
  -ci CASE_INSENSITIVE, --case-insensitive CASE_INSENSITIVE
                        Indicates whether the embeddings are all lowercased

wiki-sem-500's People

Contributors

peblair avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

yuval arcodergh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.