Giter Site home page Giter Site logo

williamleif / histwords Goto Github PK

View Code? Open in Web Editor NEW
411.0 24.0 92.0 1.02 MB

Collection of tools for building diachronic/historical word vectors

Home Page: http://nlp.stanford.edu/projects/histwords/

License: Apache License 2.0

Python 81.94% Shell 4.21% C 0.41% Makefile 0.53% JavaScript 9.76% HTML 0.51% CSS 2.64%

histwords's Introduction

Word Embeddings for Historical Text

Author: William Hamilton ([email protected])

Overview

An eclectic collection of tools for analyzing historical language change using vector space semantics.

alt text

Pre-trained historical embeddings

Various embeddings (for many languages and using different embeddings approaches are available on the project website.

Some pre-trained word2vec (i.e., SGNS) historical word vectors for multiple languages (constructed via Google N-grams) are also available here:

All except Chinese contain embeddings for the decades in the range 1800s-1990s (2000s are excluded because of sampling changes in the N-grams corpus). The Chinese data starts in 1950.

Embeddings constructed using the Corpus of Historical American English (COHA) are also available:

example.sh contains an example run, showing how to download and use the embeddings. example.py shows how to use the vector representations in the Python code (assuming you have already run the example.sh script.)

This paper describes how the embeddings were constructed. If you make use of these embeddings in your research, please cite the following:

@inproceedings{hamilton_diachronic_2016, title = {Diachronic {Word} {Embeddings} {Reveal} {Statistical} {Laws} of {Semantic} {Change}}, url = {http://arxiv.org/abs/1605.09096}, booktitle = {Proc. {Assoc}. {Comput}. {Ling}. ({ACL})}, author = {Hamilton, William L. and Leskovec, Jure and Jurafsky, Dan}, year = {2016} }

Training your own embeddings

You can use the provided code to train your own embeddings (see code organization below). However, thanks to Ryan Heuser you can also simply train embeddings with gensim (https://radimrehurek.com/gensim/) and use Ryan's port of my code to align the gensim models between time periods (https://gist.github.com/quadrismegistus/09a93e219a6ffc4f216fb85235535faf). Gensim contains many easy-to-use variants of word embeddings (e.g., LSI/SVD, word2vec, wordrank, ...), wrappers for using other packages like GloVe, and is very well maintained, so this solution is recommended.

Code organization

The structure of the code (in terms of folder organization) is as follows:

Main folder for using historical embeddings:

Folders with pre-processing code and active research code (potentially unstable):

example.py shows how to compute the simlarity series for two words over time, which is how we evaluated different methods against the attested semantic shifts listed in our paper.

If you want to learn historical embeddings for new data, the code in the sgns directory is recommended, which can be run with the default settings. As long as your corpora has at least 100 million words per time-period, this is the best method. For smaller corpora, using the representations/ppmigen.py code followed by the vecanalysis/makelowdim.py code (to learn SVD embeddings) is recommended. In either case, the vecanalysis/seq_procrustes.py code should be used to align the learned embeddings. The default hyperparameters should suffice for most use cases.

However, as a caveat to the above, the code is somewhat messy, unstable, and specific to the historical corpora that it was originally designed for. If you are looking for a nice, off-the-shelf toolbox to run word2vec, I recommend you check out gensim.

Dependencies

Core dependencies:

You will also need Juptyer/IPython to run any IPython notebooks.

histwords's People

Contributors

williamleif avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

histwords's Issues

code for visualizing results

very cool work!

  1. I have been looking for the code that uses t-SNE to create the visualizations that are on the front page of the git repo, but can't find them. I've grepped for usage of sklearn (and for 'manifold' / 'tsne' key words) but only found sklearn's normalization in use - is the visualization code in the repo?

  2. the visualization on the frontpage shows broadcast 1900s twice (in the middle panel) - is that intentional?

Problem installing, seemingly due to (missing?) folder "cooccurrence"

bash-4.3$ pip install --user git+https://github.com/williamleif/histwords.git
You are using pip version 6.0.8, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting git+https://github.com/williamleif/histwords.git
Cloning https://github.com/williamleif/histwords.git to /tmp/pip-GQOH9S-build
Traceback (most recent call last):

  File "<string>", line 20, in <module>

  File "/tmp/pip-GQOH9S-build/setup.py", line 7, in <module>

    ext_modules = cythonize(["googlengram/pullscripts/*.pyx", "cooccurrence/*.pyx"]),

  File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 754, in cythonize

    aliases=aliases)

  File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 649, in create_extension_list

    for file in nonempty(extended_iglob(filepattern), "'%s' doesn't match any files" % filepattern):

  File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 103, in nonempty

    raise ValueError(error_msg)

ValueError: 'cooccurrence/*.pyx' doesn't match any files

----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-GQOH9S-build

bash-4.3$

wrong setup information in "setup.py"

Existing "setup.py" file could not enable successful installation when using "pip install", for the value "sklearn" in "install_requires" is no longer available for installation. The correct value should be "scikit-learn". Installation successes after this change.

Could you provide embeddings that are not normalised?

I noticed the pretrained embeddings open for download are all normalised with each vector's L2 norm being 1. However, as normalisation causes information lost, can you provide me the original embeddings?
thanx.

Difficulties to use seq_procrustes.py with new embeddings

Hi,

I am currently experiencing some difficulties generating new embeddings with your code for visualizing words over time.

For now, I generated separared embeddings by year using sgns/hyperwords. Seems to be ok.

I know try to use your script vecanalysis/seq_procrustes.py, but I think I do not use the correct format for the needed count file: I suppose it's not the same than the one genetated in hyperwords? Maybe I missed it but is there any example of this file somewhere?

I downloaded the example "embeddings/eng-fiction-all_sgns" for visualisation (and it works), but could not find any count file.

Thank you for the answer.

Best regards.

Array multiplication or matrix multiplication?

In line of 108 and 110 of histwords/representations/embedding.py it seems that you use array multiplication instead of matrix multiplication. Is this intended? Because usually s and u are combined by using matrix multiplication, I think.

SGNS results

Hi,

I have a problem in re-generating SGNS embeddings on google ngram corpus

I follow these steps:

  1. use histwords/googlengram/pullscripts/posgrab.py to generate counts for 1-gram
  2. use histwords/googlengram/pullscripts/downloadandsplit.py then histwords/googlengram/pullscripts/gramgrab.py (set context to 4)
  3. use histwords/googlengram/pullscripts/runmerge.py on the output from 2 and then histwords/googlengram/pullscripts/indexmerge.py
  4. use histwords/googlengram/freqperyear.py on the output of 3
  5. use histwords/googlengram/makedecades.py on the output of 3
  6. use histwords/sgns/makecorpus.py py passing the output of 1, 4 and 5
  7. train embeddings using histwords/sgns/runword2vec.py (using --sequential option)
  8. use histwords/sgns/postprocessingsgns.py on the trained data.

My problem is that the vectors generated is not the same as pre-trained vectors on http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip. The size of vocabulary is about 50000 while yours 100000

So, my question are there wrong in the steps I follow? or can you help me with any info why this happens

Thanks,

Zero-valued vectors?

Regarding the pre-trained vectors for some of the corpora: (on the HistWords website)

For specific decades, there appear to be a handful of word vectors that are "0.0" across all 300 dimensions. It should be noted that for these corresponding words, they are still present in the corpus for this particular decade.

However, they do not seem to get any sort of representation across 300 dimensions, and have been assigned zero values throughout. For example, the vector for the word 'autism', from the 1800s decade of the Google n-grams eng-all vectors is [0.0 ... 0.0] for all 300 dimensions.

Would treating these words as simply 'missing' from the corpus at this particular decade be apt?

Ask for help

I'm sorry to disturb you. I'm a newer. I try my best to study the code, but I don't know how to solve it. There may be two errors in the scripts.

1.When I ran the script , there generated 5 files(sgns.contexts.txt, sgns.contexts.bin, sgns.words.txt, sgns.words.bin, sgns.words.words).what's the last file? Why is sgns.words.words blank?Can you offer the source code of the word2vecf?

word2vecf/word2vecf -train w2.sub/pairs -pow 0.75 -cvocab w2.sub/counts.contexts.vocab -wvocab w2.sub/counts.words.vocab -dumpcv w2.sub/sgns.contexts -output w2.sub/sgns.words -threads 10 -negative 15 -size 500;

2.the file of vecanalysis has no representations. If I want to get the alligned embeddings, how can I operate it. What's the perparameters to coney in the scripts of the seq_procrustes.py?

from vecanalysis.representations.representation_factory import create_representation

Thanks for your help.Thanks.@williamleif

Different result found in the released vectors on Chinese corpus against the paper

Hi, I'm working on the Chinese corpus downloaded from Histwords.

I read the vectors of 病毒 & 电脑 and get the following results for cosine similarity:

('病毒', '电脑')
1950, cosine similarity=0.000
1960, cosine similarity=0.000
1970, cosine similarity=0.000
1980, cosine similarity=0.360
1990, cosine similarity=0.263

The Spearman correlation between [0, 0, 0, 0.36, 0.26] and [1950, 1960, 1970, 1980, 1990] is 0.78. However, in the paper reports the correlation as 0.89 (at the end of section 3.2).

Is there anything going wrong with my data processing? Thank you for your attention.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.