williamleif / histwords Goto Github PK

View Code? Open in Web Editor NEW

411.0 24.0 92.0 1.02 MB

Collection of tools for building diachronic/historical word vectors

Home Page: http://nlp.stanford.edu/projects/histwords/

License: Apache License 2.0

Python 81.94% Shell 4.21% C 0.41% Makefile 0.53% JavaScript 9.76% HTML 0.51% CSS 2.64%

histwords's Introduction

Word Embeddings for Historical Text

Author: William Hamilton ([email protected])

Project Website

Overview

An eclectic collection of tools for analyzing historical language change using vector space semantics.

Pre-trained historical embeddings

Various embeddings (for many languages and using different embeddings approaches are available on the project website.

Some pre-trained word2vec (i.e., SGNS) historical word vectors for multiple languages (constructed via Google N-grams) are also available here:

All except Chinese contain embeddings for the decades in the range 1800s-1990s (2000s are excluded because of sampling changes in the N-grams corpus). The Chinese data starts in 1950.

Embeddings constructed using the Corpus of Historical American English (COHA) are also available:

example.sh contains an example run, showing how to download and use the embeddings. example.py shows how to use the vector representations in the Python code (assuming you have already run the example.sh script.)

This paper describes how the embeddings were constructed. If you make use of these embeddings in your research, please cite the following:

@inproceedings{hamilton_diachronic_2016, title = {Diachronic {Word} {Embeddings} {Reveal} {Statistical} {Laws} of {Semantic} {Change}}, url = {http://arxiv.org/abs/1605.09096}, booktitle = {Proc. {Assoc}. {Comput}. {Ling}. ({ACL})}, author = {Hamilton, William L. and Leskovec, Jure and Jurafsky, Dan}, year = {2016} }

Training your own embeddings

You can use the provided code to train your own embeddings (see code organization below). However, thanks to Ryan Heuser you can also simply train embeddings with gensim (https://radimrehurek.com/gensim/) and use Ryan's port of my code to align the gensim models between time periods (https://gist.github.com/quadrismegistus/09a93e219a6ffc4f216fb85235535faf). Gensim contains many easy-to-use variants of word embeddings (e.g., LSI/SVD, word2vec, wordrank, ...), wrappers for using other packages like GloVe, and is very well maintained, so this solution is recommended.

Code organization

The structure of the code (in terms of folder organization) is as follows:

Main folder for using historical embeddings:

representations contains code that provides a high-level interface to (historical) word vectors and is originally based upon Omer Levy's hyperwords package (https://bitbucket.org/omerlevy/hyperwords).

Folders with pre-processing code and active research code (potentially unstable):

googlengram contains code for pulling and processing historical Google N-Gram Data (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html).
coha contains code for pulling and processing historical data from the COHA corpus (http://corpus.byu.edu/coha/).
statutils contains helper code for common statistical tasks.
vecanalysis contains code for evaluating and analyzing historical word vectors.
sgns contains a modified version of Google's word2vec code (https://code.google.com/archive/p/word2vec/)

example.py shows how to compute the simlarity series for two words over time, which is how we evaluated different methods against the attested semantic shifts listed in our paper.

If you want to learn historical embeddings for new data, the code in the sgns directory is recommended, which can be run with the default settings. As long as your corpora has at least 100 million words per time-period, this is the best method. For smaller corpora, using the representations/ppmigen.py code followed by the vecanalysis/makelowdim.py code (to learn SVD embeddings) is recommended. In either case, the vecanalysis/seq_procrustes.py code should be used to align the learned embeddings. The default hyperparameters should suffice for most use cases.

However, as a caveat to the above, the code is somewhat messy, unstable, and specific to the historical corpora that it was originally designed for. If you are looking for a nice, off-the-shelf toolbox to run word2vec, I recommend you check out gensim.

Dependencies

Core dependencies:

python 2.7
sklearn: http://scikit-learn.org/stable/
cython: http://docs.cython.org/src/quickstart/install.html
statsmodels: http://statsmodels.sourceforge.net/

You will also need Juptyer/IPython to run any IPython notebooks.

histwords's People

Contributors

Stargazers

Watchers

Forkers

tpnguyen lingulist albertmcma clear-datacenter eriche2016 nudtchengqing muranava kalyanp ml-lab pedro-walter markostam codeaudit nancyx chuyuhanxin004 rishabgargeya maquigu endertunc leyiwang bidexbido okayzed martinafdez sudodo alaa-ebshihy steccami loretoparisi minghao2016 nabilmagnus jullan shubhampachori12110095 grv1207 bdubbya bnewm0609 afcarl petershan1119 aan680 liangshichen aculich jieyuzhao grantglass theresearchproject chubbymaggie psh93 ilaine damian-romero iamyourking007 jan21 chenguoooo lexuswang murrayds lapwai eddings huhailinguist jvansoest bfsujason yaoyi1997 rababalkhalifa mmochtak zhicongchen abogdan271 nishanthsanjeev fagan2888 joshzyj yago1994 mukund-v samvanmeer justcherie ashjanalsulaimani bazzmx robertnward marjp cpatainfei christopher-y-to andreaspung tian-wen-wb cksteven fyjgreatlion idiig ghadarasim emileighharrison zizhe-wang01 2022-2023-m2-nlp-group-5 pietrosanguin ad2000x jbfish00 anecdotal yhliu2022 xinwang-ou commonerd

histwords's Issues

update gensim procrustes implementation

Hello,

The current gist (https://gist.github.com/louridas/a3cdb1b109ac03a8e202f4b19c9335b3) for the gensim Procrustes implementation does not work (due to changes in gensim).

I have made a fix here:

https://gist.github.com/louridas/a3cdb1b109ac03a8e202f4b19c9335b3

Best Regards,

Panos.

Can't download the pre-trained embeddings via the link listed

I went to the project site too which has the same issue

code for visualizing results

very cool work!

I have been looking for the code that uses t-SNE to create the visualizations that are on the front page of the git repo, but can't find them. I've grepped for usage of sklearn (and for 'manifold' / 'tsne' key words) but only found sklearn's normalization in use - is the visualization code in the repo?
the visualization on the frontpage shows broadcast 1900s twice (in the middle panel) - is that intentional?

To learn historical embeddings for new data

Hi!
To learn historical embeddings for new data, what's the implementation and pre-processing details?

Problem installing, seemingly due to (missing?) folder "cooccurrence"

bash-4.3$ pip install --user git+https://github.com/williamleif/histwords.git
You are using pip version 6.0.8, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting git+https://github.com/williamleif/histwords.git
Cloning https://github.com/williamleif/histwords.git to /tmp/pip-GQOH9S-build
Traceback (most recent call last):

  File "<string>", line 20, in <module>

  File "/tmp/pip-GQOH9S-build/setup.py", line 7, in <module>

    ext_modules = cythonize(["googlengram/pullscripts/*.pyx", "cooccurrence/*.pyx"]),

  File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 754, in cythonize

    aliases=aliases)

  File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 649, in create_extension_list

    for file in nonempty(extended_iglob(filepattern), "'%s' doesn't match any files" % filepattern):

  File "/usr/lib64/python2.7/site-packages/Cython/Build/Dependencies.py", line 103, in nonempty

    raise ValueError(error_msg)

ValueError: 'cooccurrence/*.pyx' doesn't match any files

----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-GQOH9S-build

bash-4.3$

wrong setup information in "setup.py"

Existing "setup.py" file could not enable successful installation when using "pip install", for the value "sklearn" in "install_requires" is no longer available for installation. The correct value should be "scikit-learn". Installation successes after this change.

Could you provide embeddings that are not normalised?

I noticed the pretrained embeddings open for download are all normalised with each vector's L2 norm being 1. However, as normalisation causes information lost, can you provide me the original embeddings?
thanx.

Difficulties to use seq_procrustes.py with new embeddings

Hi,

I am currently experiencing some difficulties generating new embeddings with your code for visualizing words over time.

For now, I generated separared embeddings by year using sgns/hyperwords. Seems to be ok.

I know try to use your script vecanalysis/seq_procrustes.py, but I think I do not use the correct format for the needed count file: I suppose it's not the same than the one genetated in hyperwords? Maybe I missed it but is there any example of this file somewhere?

I downloaded the example "embeddings/eng-fiction-all_sgns" for visualisation (and it works), but could not find any count file.

Thank you for the answer.

Best regards.

how to load the pre-trained word vectors of histwords using gensim?

Thank you for your great work. I am wondering can I load the pre-trained word vectors of histwords using gensim?

Array multiplication or matrix multiplication?

In line of 108 and 110 of histwords/representations/embedding.py it seems that you use array multiplication instead of matrix multiplication. Is this intended? Because usually s and u are combined by using matrix multiplication, I think.

SGNS results

Hi,

I have a problem in re-generating SGNS embeddings on google ngram corpus

I follow these steps:

use histwords/googlengram/pullscripts/posgrab.py to generate counts for 1-gram
use histwords/googlengram/pullscripts/downloadandsplit.py then histwords/googlengram/pullscripts/gramgrab.py (set context to 4)
use histwords/googlengram/pullscripts/runmerge.py on the output from 2 and then histwords/googlengram/pullscripts/indexmerge.py
use histwords/googlengram/freqperyear.py on the output of 3
use histwords/googlengram/makedecades.py on the output of 3
use histwords/sgns/makecorpus.py py passing the output of 1, 4 and 5
train embeddings using histwords/sgns/runword2vec.py (using --sequential option)
use histwords/sgns/postprocessingsgns.py on the trained data.

My problem is that the vectors generated is not the same as pre-trained vectors on http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip. The size of vocabulary is about 50000 while yours 100000

So, my question are there wrong in the steps I follow? or can you help me with any info why this happens

Thanks,

Is there any tutorial on how to use the code?

Zero-valued vectors?

Regarding the pre-trained vectors for some of the corpora: (on the HistWords website)

For specific decades, there appear to be a handful of word vectors that are "0.0" across all 300 dimensions. It should be noted that for these corresponding words, they are still present in the corpus for this particular decade.

However, they do not seem to get any sort of representation across 300 dimensions, and have been assigned zero values throughout. For example, the vector for the word 'autism', from the 1800s decade of the Google n-grams eng-all vectors is [0.0 ... 0.0] for all 300 dimensions.

Would treating these words as simply 'missing' from the corpus at this particular decade be apt?

Ask for help

I'm sorry to disturb you. I'm a newer. I try my best to study the code, but I don't know how to solve it. There may be two errors in the scripts.

1.When I ran the script , there generated 5 files(sgns.contexts.txt, sgns.contexts.bin, sgns.words.txt, sgns.words.bin, sgns.words.words).what's the last file? Why is sgns.words.words blank?Can you offer the source code of the word2vecf?

word2vecf/word2vecf -train w2.sub/pairs -pow 0.75 -cvocab w2.sub/counts.contexts.vocab -wvocab w2.sub/counts.words.vocab -dumpcv w2.sub/sgns.contexts -output w2.sub/sgns.words -threads 10 -negative 15 -size 500;

2.the file of vecanalysis has no representations. If I want to get the alligned embeddings, how can I operate it. What's the perparameters to coney in the scripts of the seq_procrustes.py?

from vecanalysis.representations.representation_factory import create_representation

Thanks for your help.Thanks.@williamleif

Different result found in the released vectors on Chinese corpus against the paper

Hi, I'm working on the Chinese corpus downloaded from Histwords.

I read the vectors of 病毒 & 电脑 and get the following results for cosine similarity:

('病毒', '电脑')
1950, cosine similarity=0.000
1960, cosine similarity=0.000
1970, cosine similarity=0.000
1980, cosine similarity=0.360
1990, cosine similarity=0.263

The Spearman correlation between [0, 0, 0, 0.36, 0.26] and [1950, 1960, 1970, 1980, 1990] is 0.78. However, in the paper reports the correlation as 0.89 (at the end of section 3.2).

Is there anything going wrong with my data processing? Thank you for your attention.