Giter Site home page Giter Site logo

explosion / sense2vec Goto Github PK

View Code? Open in Web Editor NEW
1.6K 49.0 237.0 1.17 MB

πŸ¦† Contextually-keyed word vectors

Home Page: https://explosion.ai/blog/sense2vec-reloaded

License: MIT License

Python 99.75% Shell 0.25%
spacy nlp natural-language-processing word2vec python sense2vec gensim gensim-word2vec machine-learning

sense2vec's Introduction

sense2vec: Contextually-keyed word vectors

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. For more details, check out our blog post. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo.

πŸ¦† Version 2.0 (for spaCy v3) out now! Read the release notes here.

tests Current Release Version pypi Version Code style: black

✨ Features

  • Query vectors for multi-word phrases based on part-of-speech tags and entity labels.
  • spaCy pipeline component and extension attributes.
  • Fully serializable so you can easily ship your sense2vec vectors with your spaCy model packages.
  • Optional caching of nearest neighbors for super fast "most similar" queries.
  • Train your own vectors using a pretrained spaCy model, raw text and GloVe or Word2Vec via fastText (details).
  • Prodigy annotation recipes for evaluating models, creating lists of similar multi-word phrases and converting them to match patterns, e.g. for rule-based NER or to bootstrap NER annotation (details & examples).

πŸš€ Quickstart

Standalone usage

from sense2vec import Sense2Vec

s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")
query = "natural_language_processing|NOUN"
assert query in s2v
vector = s2v[query]
freq = s2v.get_freq(query)
most_similar = s2v.most_similar(query, n=3)
# [('machine_learning|NOUN', 0.8986967),
#  ('computer_vision|NOUN', 0.8636297),
#  ('deep_learning|NOUN', 0.8573361)]

Usage as a spaCy pipeline component

⚠️ Note that this example describes usage with spaCy v3. For usage with spaCy v2, download sense2vec==1.0.3 and check out the v1.x branch of this repo.

import spacy

nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)
# [(('machine learning', 'NOUN'), 0.8986967),
#  (('computer vision', 'NOUN'), 0.8636297),
#  (('deep learning', 'NOUN'), 0.8573361)]

Interactive demos

To try out our pretrained vectors trained on Reddit comments, check out the interactive sense2vec demo.

This repo also includes a Streamlit demo script for exploring vectors and the most similar phrases. After installing streamlit, you can run the script with streamlit run and one or more paths to pretrained vectors as positional arguments on the command line. For example:

pip install streamlit
streamlit run https://raw.githubusercontent.com/explosion/sense2vec/master/scripts/streamlit_sense2vec.py /path/to/vectors

Pretrained vectors

To use the vectors, download the archive(s) and pass the extracted directory to Sense2Vec.from_disk or Sense2VecComponent.from_disk. The vector files are attached to the GitHub release. Large files have been split into multi-part downloads.

Vectors Size Description πŸ“₯ Download (zipped)
s2v_reddit_2019_lg 4 GB Reddit comments 2019 (01-07) part 1, part 2, part 3
s2v_reddit_2015_md 573 MB Reddit comments 2015 part 1

To merge the multi-part archives, you can run the following:

cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz

⏳ Installation & Setup

sense2vec releases are available on pip:

pip install sense2vec

To use pretrained vectors, download one of the vector packages, unpack the .tar.gz archive and point from_disk to the extracted data directory:

from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")

πŸ‘©β€πŸ’» Usage

Usage with spaCy v3

The easiest way to use the library and vectors is to plug it into your spaCy pipeline. The sense2vec package exposes a Sense2VecComponent, which can be initialised with the shared vocab and added to your spaCy pipeline as a custom pipeline component. By default, components are added to the end of the pipeline, which is the recommended position for this component, since it needs access to the dependency parse and, if available, named entities.

import spacy
from sense2vec import Sense2VecComponent

nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")

The component will add several extension attributes and methods to spaCy's Token and Span objects that let you retrieve vectors and frequencies, as well as most similar terms.

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)

For entities, the entity labels are used as the "sense" (instead of the token's part-of-speech tag):

doc = nlp("A sentence about Facebook and Google.")
for ent in doc.ents:
    assert ent._.in_s2v
    most_similar = ent._.s2v_most_similar(3)

Available attributes

The following extension attributes are exposed on the Doc object via the ._ property:

Name Attribute Type Type Description
s2v_phrases property list All sense2vec-compatible phrases in the given Doc (noun phrases, named entities).

The following attributes are available via the ._ property of Token and Span objects – for example token._.in_s2v:

| Name | Attribute Type | Return Type | Description | | ------------------ | -------------- | ------------------ | ---------------------------------------------------------------------------------- | --------------- | ------- | | in_s2v | property | bool | Whether a key exists in the vector map. | | s2v_key | property | unicode | The sense2vec key of the given object, e.g. "duck | NOUN". | | s2v_vec | property | ndarray[float32] | The vector of the given key. | | s2v_freq | property | int | The frequency of the given key. | | s2v_other_senses | property | list | Available other senses, e.g. "duck | VERB"for"duck | NOUN". | | s2v_most_similar | method | list | Get the n most similar terms. Returns a list of ((word, sense), score) tuples. | | s2v_similarity | method | float | Get the similarity to another Token or Span. |

⚠️ A note on span attributes: Under the hood, entities in doc.ents are Span objects. This is why the pipeline component also adds attributes and methods to spans and not just tokens. However, it's not recommended to use the sense2vec attributes on arbitrary slices of the document, since the model likely won't have a key for the respective text. Span objects also don't have a part-of-speech tag, so if no entity label is present, the "sense" defaults to the root's part-of-speech tag.

Adding sense2vec to a trained pipeline

If you're training and packaging a spaCy pipeline and want to include a sense2vec component in it, you can load in the data via the [initialize] block of the training config:

[initialize.components]

[initialize.components.sense2vec]
data_path = "/path/to/s2v_reddit_2015_md"

Standalone usage

You can also use the underlying Sense2Vec class directly and load in the vectors using the from_disk method. See below for the available API methods.

from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("/path/to/reddit_vectors-1.1.0")
most_similar = s2v.most_similar("natural_language_processing|NOUN", n=10)

⚠️ Important note: To look up entries in the vectors table, the keys need to follow the scheme of phrase_text|SENSE (note the _ instead of spaces and the | before the tag or label) – for example, machine_learning|NOUN. Also note that the underlying vector table is case-sensitive.

πŸŽ› API

class Sense2Vec

The standalone Sense2Vec object that holds the vectors, strings and frequencies.

method Sense2Vec.__init__

Initialize the Sense2Vec object.

Argument Type Description
shape tuple The vector shape. Defaults to (1000, 128).
strings spacy.strings.StringStore Optional string store. Will be created if it doesn't exist.
senses list Optional list of all available senses. Used in methods that generate the best sense or other senses.
vectors_name unicode Optional name to assign to the Vectors table, to prevent clashes. Defaults to "sense2vec".
overrides dict Optional custom functions to use, mapped to names registered via the registry, e.g. {"make_key": "custom_make_key"}.
RETURNS Sense2Vec The newly constructed object.
s2v = Sense2Vec(shape=(300, 128), senses=["VERB", "NOUN"])

method Sense2Vec.__len__

The number of rows in the vectors table.

Argument Type Description
RETURNS int The number of rows in the vectors table.
s2v = Sense2Vec(shape=(300, 128))
assert len(s2v) == 300

method Sense2Vec.__contains__

Check if a key is in the vectors table.

Argument Type Description
key unicode / int The key to look up.
RETURNS bool Whether the key is in the table.
s2v = Sense2Vec(shape=(10, 4))
s2v.add("avocado|NOUN", numpy.asarray([4, 2, 2, 2], dtype=numpy.float32))
assert "avocado|NOUN" in s2v
assert "avocado|VERB" not in s2v

method Sense2Vec.__getitem__

Retrieve a vector for a given key. Returns None if the key is not in the table.

Argument Type Description
key unicode / int The key to look up.
RETURNS numpy.ndarray The vector or None.
vec = s2v["avocado|NOUN"]

method Sense2Vec.__setitem__

Set a vector for a given key. Will raise an error if the key doesn't exist. To add a new entry, use Sense2Vec.add.

Argument Type Description
key unicode / int The key.
vector numpy.ndarray The vector to set.
vec = s2v["avocado|NOUN"]
s2v["avacado|NOUN"] = vec

method Sense2Vec.add

Add a new vector to the table.

Argument Type Description
key unicode / int The key to add.
vector numpy.ndarray The vector to add.
freq int Optional frequency count. Used to find best matching senses.
vec = s2v["avocado|NOUN"]
s2v.add("πŸ₯‘|NOUN", vec, 1234)

method Sense2Vec.get_freq

Get the frequency count for a given key.

Argument Type Description
key unicode / int The key to look up.
default - Default value to return if no frequency is found.
RETURNS int The frequency count.
vec = s2v["avocado|NOUN"]
s2v.add("πŸ₯‘|NOUN", vec, 1234)
assert s2v.get_freq("πŸ₯‘|NOUN") == 1234

method Sense2Vec.set_freq

Set a frequency count for a given key.

Argument Type Description
key unicode / int The key to set the count for.
freq int The frequency count.
s2v.set_freq("avocado|NOUN", 104294)

method Sense2Vec.__iter__, Sense2Vec.items

Iterate over the entries in the vectors table.

Argument Type Description
YIELDS tuple String key and vector pairs in the table.
for key, vec in s2v:
    print(key, vec)

for key, vec in s2v.items():
    print(key, vec)

method Sense2Vec.keys

Iterate over the keys in the table.

Argument Type Description
YIELDS unicode The string keys in the table.
all_keys = list(s2v.keys())

method Sense2Vec.values

Iterate over the vectors in the table.

Argument Type Description
YIELDS numpy.ndarray The vectors in the table.
all_vecs = list(s2v.values())

property Sense2Vec.senses

The available senses in the table, e.g. "NOUN" or "VERB" (added at initialization).

Argument Type Description
RETURNS list The available senses.
s2v = Sense2Vec(senses=["VERB", "NOUN"])
assert "VERB" in s2v.senses

property Sense2vec.frequencies

The frequencies of the keys in the table, in descending order.

Argument Type Description
RETURNS list The (key, freq) tuples by frequency, descending.
most_frequent = s2v.frequencies[:10]
key, score = s2v.frequencies[0]

method Sense2vec.similarity

Make a semantic similarity estimate of two keys or two sets of keys. The default estimate is cosine similarity using an average of vectors.

Argument Type Description
keys_a unicode / int / iterable The string or integer key(s).
keys_b unicode / int / iterable The other string or integer key(s).
RETURNS float The similarity score.
keys_a = ["machine_learning|NOUN", "natural_language_processing|NOUN"]
keys_b = ["computer_vision|NOUN", "object_detection|NOUN"]
print(s2v.similarity(keys_a, keys_b))
assert s2v.similarity("machine_learning|NOUN", "machine_learning|NOUN") == 1.0

method Sense2Vec.most_similar

Get the most similar entries in the table. If more than one key is provided, the average of the vectors is used. To make this method faster, see the script for precomputing a cache of the nearest neighbors.

Argument Type Description
keys unicode / int / iterableΒ  The string or integer key(s) to compare to.
n int The number of similar keys to return. Defaults to 10.
batch_size int The batch size to use. Defaults to 16.
RETURNS list The (key, score) tuples of the most similar vectors.
most_similar = s2v.most_similar("natural_language_processing|NOUN", n=3)
# [('machine_learning|NOUN', 0.8986967),
#  ('computer_vision|NOUN', 0.8636297),
#  ('deep_learning|NOUN', 0.8573361)]

method Sense2Vec.get_other_senses

Find other entries for the same word with a different sense, e.g. "duck|VERB" for "duck|NOUN".

Argument Type Description
key unicode / int The key to check.
ignore_case bool Check for uppercase, lowercase and titlecase. Defaults to True.
RETURNS list The string keys of other entries with different senses.
other_senses = s2v.get_other_senses("duck|NOUN")
# ['duck|VERB', 'Duck|ORG', 'Duck|VERB', 'Duck|PERSON', 'Duck|ADJ']

method Sense2Vec.get_best_sense

Find the best-matching sense for a given word based on the available senses and frequency counts. Returns None if no match is found.

Argument Type Description
word unicode The word to check.
senses list Optional list of senses to limit the search to. If not set / empty, all senses in the vectors are used.
ignore_case bool Check for uppercase, lowercase and titlecase. Defaults to True.
RETURNS unicode The best-matching key or None.
assert s2v.get_best_sense("duck") == "duck|NOUN"
assert s2v.get_best_sense("duck", ["VERB", "ADJ"]) == "duck|VERB"

method Sense2Vec.to_bytes

Serialize a Sense2Vec object to a bytestring.

Argument Type Description
exclude list Names of serialization fields to exclude.
RETURNS bytes The serialized Sense2Vec object.
s2v_bytes = s2v.to_bytes()

method Sense2Vec.from_bytes

Load a Sense2Vec object from a bytestring.

Argument Type Description
bytes_data bytes The data to load.
exclude list Names of serialization fields to exclude.
RETURNS Sense2Vec The loaded object.
s2v_bytes = s2v.to_bytes()
new_s2v = Sense2Vec().from_bytes(s2v_bytes)

method Sense2Vec.to_disk

Serialize a Sense2Vec object to a directory.

Argument Type Description
path unicode / Path The path.
exclude list Names of serialization fields to exclude.
s2v.to_disk("/path/to/sense2vec")

method Sense2Vec.from_disk

Load a Sense2Vec object from a directory.

Argument Type Description
path unicode / Path The path to load from
exclude list Names of serialization fields to exclude.
RETURNS Sense2Vec The loaded object.
s2v.to_disk("/path/to/sense2vec")
new_s2v = Sense2Vec().from_disk("/path/to/sense2vec")

class Sense2VecComponent

The pipeline component to add sense2vec to spaCy pipelines.

method Sense2VecComponent.__init__

Initialize the pipeline component.

Argument Type Description
vocab Vocab The shared Vocab. Mostly used for the shared StringStore.
shape tuple The vector shape.
merge_phrases bool Whether to merge sense2vec phrases into one token. Defaults to False.
lemmatize bool Always look up lemmas if available in the vectors, otherwise default to original word. Defaults to False.
overrides Optional custom functions to use, mapped to names registred via the registry, e.g. {"make_key": "custom_make_key"}.
RETURNS Sense2VecComponent The newly constructed object.
s2v = Sense2VecComponent(nlp.vocab)

classmethod Sense2VecComponent.from_nlp

Initialize the component from an nlp object. Mostly used as the component factory for the entry point (see setup.cfg) and to auto-register via the @spacy.component decorator.

Argument Type Description
nlp Language The nlp object.
**cfg - Optional config parameters.
RETURNS Sense2VecComponent The newly constructed object.
s2v = Sense2VecComponent.from_nlp(nlp)

method Sense2VecComponent.__call__

Process a Doc object with the component. Typically only called as part of the spaCy pipeline and not directly.

Argument Type Description
doc Doc The document to process.
RETURNS Doc the processed document.

method Sense2Vec.init_component

Register the component-specific extension attributes here and only if the component is added to the pipeline and used – otherwise, tokens will still get the attributes even if the component is only created and not added.

method Sense2VecComponent.to_bytes

Serialize the component to a bytestring. Also called when the component is added to the pipeline and you run nlp.to_bytes.

Argument Type Description
RETURNS bytes The serialized component.

method Sense2VecComponent.from_bytes

Load a component from a bytestring. Also called when you run nlp.from_bytes.

Argument Type Description
bytes_data bytes The data to load.
RETURNS Sense2VecComponent The loaded object.

method Sense2VecComponent.to_disk

Serialize the component to a directory. Also called when the component is added to the pipeline and you run nlp.to_disk.

Argument Type Description
path unicode / Path The path.

method Sense2VecComponent.from_disk

Load a Sense2Vec object from a directory. Also called when you run nlp.from_disk.

Argument Type Description
path unicode / Path The path to load from
RETURNS Sense2VecComponent The loaded object.

class registry

Function registry (powered by catalogue) to easily customize the functions used to generate keys and phrases. Allows you to decorate and name custom functions, swap them out and serialize the custom names when you save out the model. The following registry options are available:

| Name | Description | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | | registry.make_key | Given a word and sense, return a string of the key, e.g. "word | sense". | | registry.split_key | Given a string key, return a (word, sense) tuple. | | registry.make_spacy_key | Given a spaCy object (Token or Span) and a boolean prefer_ents keyword argument (whether to prefer the entity label for single tokens), return a (word, sense) tuple. Used in extension attributes to generate a key for tokens and spans. | | | registry.get_phrases | Given a spaCy Doc, return a list of Span objects used for sense2vec phrases (typically noun phrases and named entities). | | registry.merge_phrases | Given a spaCy Doc, get all sense2vec phrases and merge them into single tokens.Β  |

Each registry has a register method that can be used as a function decorator and takes one argument, the name of the custom function.

from sense2vec import registry

@registry.make_key.register("custom")
def custom_make_key(word, sense):
    return f"{word}###{sense}"

@registry.split_key.register("custom")
def custom_split_key(key):
    word, sense = key.split("###")
    return word, sense

When initializing the Sense2Vec object, you can now pass in a dictionary of overrides with the names of your custom registered functions.

overrides = {"make_key": "custom", "split_key": "custom"}
s2v = Sense2Vec(overrides=overrides)

This makes it easy to experiment with different strategies and serializing the strategies as plain strings (instead of having to pass around and/or pickle the functions themselves).

πŸš‚ Training your own sense2vec vectors

The /scripts directory contains command line utilities for preprocessing text and training your own vectors.

Requirements

To train your own sense2vec vectors, you'll need the following:

  • A very large source of raw text (ideally more than you'd use for word2vec, since the senses make the vocabulary more sparse). We recommend at least 1 billion words.
  • A pretrained spaCy model that assigns part-of-speech tags, dependencies and named entities, and populates the doc.noun_chunks. If the language you need doesn't provide a built in syntax iterator for noun phrases, you'll need to write your own. (The doc.noun_chunks and doc.ents are what sense2vec uses to determine what's a phrase.)
  • GloVe or fastText installed and built. You should be able to clone the repo and run make in the respective directory.

Step-by-step process

The training process is split up into several steps to allow you to resume at any given point. Processing scripts are designed to operate on single files, making it easy to parallellize the work. The scripts in this repo require either Glove or fastText which you need to clone and make.

For Fasttext, the scripts will require the path to the created binary file. If you're working on Windows, you can build with cmake, or alternatively use the .exe file from this unofficial repo with FastText binary builds for Windows: https://github.com/xiamx/fastText/releases.

Script Description
1. 01_parse.py Use spaCy to parse the raw text and output binary collections of Doc objects (see DocBin).
2. 02_preprocess.py Load a collection of parsed Doc objects produced in the previous step and output text files in the sense2vec format (one sentence per line and merged phrases with senses).
3. 03_glove_build_counts.py Use GloVe to build the vocabulary and counts. Skip this step if you're using Word2Vec via FastText.
4. 04_glove_train_vectors.py
04_fasttext_train_vectors.py
Use GloVe or FastText to train vectors.
5. 05_export.py Load the vectors and frequencies and output a sense2vec component that can be loaded via Sense2Vec.from_disk.
6. 06_precompute_cache.py Optional: Precompute nearest-neighbor queries for every entry in the vocab to make Sense2Vec.most_similar faster.

For more detailed documentation of the scripts, check out the source or run them with --help. For example, python scripts/01_parse.py --help.

🍳 Prodigy recipes

This package also seamlessly integrates with the Prodigy annotation tool and exposes recipes for using sense2vec vectors to quickly generate lists of multi-word phrases and bootstrap NER annotations. To use a recipe, sense2vec needs to be installed in the same environment as Prodigy. For an example of a real-world use case, check out this NER project with downloadable datasets.

The following recipes are available – see below for more detailed docs.

Recipe Description
sense2vec.teach Bootstrap a terminology list using sense2vec.
sense2vec.to-patterns Convert phrases dataset to token-based match patterns.
sense2vec.eval Evaluate a sense2vec model by asking about phrase triples.
sense2vec.eval-most-similar Evaluate a sense2vec model by correcting the most similar entries.
sense2vec.eval-ab Perform an A/B evaluation of two pretrained sense2vec vector models.

recipe sense2vec.teach

Bootstrap a terminology list using sense2vec. Prodigy will suggest similar terms based on the most similar phrases from sense2vec, and the suggestions will be adjusted as you annotate and accept similar phrases. For each seed term, the best matching sense according to the sense2vec vectors will be used.

prodigy sense2vec.teach [dataset] [vectors_path] [--seeds] [--threshold]
[--n-similar] [--batch-size] [--resume]
Argument Type Description
dataset positional Dataset to save annotations to.
vectors_path positional Path to pretrained sense2vec vectors.
--seeds, -s option One or more comma-separated seed phrases.
--threshold, -t option Similarity threshold. Defaults to 0.85.
--n-similar, -n option Number of similar items to get at once.
--batch-size, -b option Batch size for submitting annotations.
--resume, -R flag Resume from an existing phrases dataset.

Example

prodigy sense2vec.teach tech_phrases /path/to/s2v_reddit_2015_md
--seeds "natural language processing, machine learning, artificial intelligence"

recipe sense2vec.to-patterns

Convert a dataset of phrases collected with sense2vec.teach to token-based match patterns that can be used with spaCy's EntityRuler or recipes like ner.match. If no output file is specified, the patterns are written to stdout. The examples are tokenized so that multi-token terms are represented correctly, e.g.: {"label": "SHOE_BRAND", "pattern": [{ "LOWER": "new" }, { "LOWER": "balance" }]}.

prodigy sense2vec.to-patterns [dataset] [spacy_model] [label] [--output-file]
[--case-sensitive] [--dry]
Argument Type Description
dataset positional Phrase dataset to convert.
spacy_model positional spaCy model for tokenization.
label positional Label to apply to all patterns.
--output-file, -o option Optional output file. Defaults to stdout.
--case-sensitive, -CS flag Make patterns case-sensitive.
--dry, -D flag Perform a dry run and don't output anything.

Example

prodigy sense2vec.to-patterns tech_phrases en_core_web_sm TECHNOLOGY
--output-file /path/to/patterns.jsonl

recipe sense2vec.eval

Evaluate a sense2vec model by asking about phrase triples: is word A more similar to word B, or to word C? If the human mostly agrees with the model, the vectors model is good. The recipe will only ask about vectors with the same sense and supports different example selection strategies.

prodigy sense2vec.eval [dataset] [vectors_path] [--strategy] [--senses]
[--exclude-senses] [--n-freq] [--threshold] [--batch-size] [--eval-whole]
[--eval-only] [--show-scores]
Argument Type Description
dataset positional Dataset to save annotations to.
vectors_path positional Path to pretrained sense2vec vectors.
--strategy, -st option Example selection strategy. most similar (default) or random.
--senses, -s option Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used.
--exclude-senses, -es option Comma-separated list of senses to exclude. See prodigy_recipes.EVAL_EXCLUDE_SENSES fro the defaults.
--n-freq, -f option Number of most frequent entries to limit to.
--threshold, -t option Minimum similarity threshold to consider examples.
--batch-size, -b option Batch size to use.
--eval-whole, -E flag Evaluate the whole dataset instead of the current session.
--eval-only, -O flag Don't annotate, only evaluate the current dataset.
--show-scores, -S flag Show all scores for debugging.

Strategies

Name Description
most_similar Pick a random word from a random sense and get its most similar entries of the same sense. Ask about the similarity to the last and middle entry from that selection.
most_least_similar Pick a random word from a random sense and get the least similar entry from its most similar entries, and then the last most similar entry of that.
random Pick a random sample of 3 words from the same random sense.

Example

prodigy sense2vec.eval vectors_eval /path/to/s2v_reddit_2015_md
--senses NOUN,ORG,PRODUCT --threshold 0.5

UI preview of sense2vec.eval

recipe sense2vec.eval-most-similar

Evaluate a vectors model by looking at the most similar entries it returns for a random phrase and unselecting the mistakes.

prodigy sense2vec.eval [dataset] [vectors_path] [--senses] [--exclude-senses]
[--n-freq] [--n-similar] [--batch-size] [--eval-whole] [--eval-only]
[--show-scores]
Argument Type Description
dataset positional Dataset to save annotations to.
vectors_path positional Path to pretrained sense2vec vectors.
--senses, -s option Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used.
--exclude-senses, -es option Comma-separated list of senses to exclude. See prodigy_recipes.EVAL_EXCLUDE_SENSES fro the defaults.
--n-freq, -f option Number of most frequent entries to limit to.
--n-similar, -n option Number of similar items to check. Defaults to 10.
--batch-size, -b option Batch size to use.
--eval-whole, -E flag Evaluate the whole dataset instead of the current session.
--eval-only, -O flag Don't annotate, only evaluate the current dataset.
--show-scores, -S flag Show all scores for debugging.
prodigy sense2vec.eval-most-similar vectors_eval_sim /path/to/s2v_reddit_2015_md
--senses NOUN,ORG,PRODUCT

recipe sense2vec.eval-ab

Perform an A/B evaluation of two pretrained sense2vec vector models by comparing the most similar entries they return for a random phrase. The UI shows two randomized options with the most similar entries of each model and highlights the phrases that differ. At the end of the annotation session the overall stats and preferred model are shown.

prodigy sense2vec.eval [dataset] [vectors_path_a] [vectors_path_b] [--senses]
[--exclude-senses] [--n-freq] [--n-similar] [--batch-size] [--eval-whole]
[--eval-only] [--show-mapping]
Argument Type Description
dataset positional Dataset to save annotations to.
vectors_path_a positional Path to pretrained sense2vec vectors.
vectors_path_b positional Path to pretrained sense2vec vectors.
--senses, -s option Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used.
--exclude-senses, -es option Comma-separated list of senses to exclude. See prodigy_recipes.EVAL_EXCLUDE_SENSES fro the defaults.
--n-freq, -f option Number of most frequent entries to limit to.
--n-similar, -n option Number of similar items to check. Defaults to 10.
--batch-size, -b option Batch size to use.
--eval-whole, -E flag Evaluate the whole dataset instead of the current session.
--eval-only, -O flag Don't annotate, only evaluate the current dataset.
--show-mapping, -S flag Show which models are option 1 and option 2 in the UI (for debugging).
prodigy sense2vec.eval-ab vectors_eval_sim /path/to/s2v_reddit_2015_md /path/to/s2v_reddit_2019_md --senses NOUN,ORG,PRODUCT

UI preview of sense2vec.eval-ab

Pretrained vectors

The pretrained Reddit vectors support the following "senses", either part-of-speech tags or entity labels. For more details, see spaCy's annotation scheme overview.

Tag Description Examples
ADJ adjective big, old, green
ADP adposition in, to, during
ADV adverb very, tomorrow, down, where
AUX auxiliaryΒ  is, has (done), will (do)
CONJ conjunction and, or, but
DET determiner a, an, the
INTJ interjection psst, ouch, bravo, hello
NOUN noun girl, cat, tree, air, beauty
NUM numeral 1, 2017, one, seventy-seven, MMXIV
PART particle 's, not
PRON pronoun I, you, he, she, myself, somebody
PROPN proper noun Mary, John, London, NATO, HBO
PUNCT punctuation , ? ( )
SCONJ subordinating conjunction if, while, that
SYM symbol $, %, =, :), 😝
VERB verb run, runs, running, eat, ate, eating
Entity Label Description
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FACILITY Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LANGUAGE Any named language.

sense2vec's People

Contributors

adrianeboyd avatar ahalterman avatar anxo06 avatar cerules avatar chanind avatar dasheffie avatar henningpeters avatar honnibal avatar ines avatar init-random avatar koaning avatar mukesh-mehta avatar shademe avatar svlandeg avatar syllog1sm avatar tolomaus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sense2vec's Issues

Installation Problem ...

That's the error I get:

C:\Python27>py -2.7 -m pip install -e git+git://github.com/spacy-io/sense2vec.gi
t#egg=sense2vec
Obtaining sense2vec from git+git://github.com/spacy-io/sense2vec.git#egg=sense2v
ec
Updating c:\python27\src\sense2vec clone
Complete output from command python setup.py egg_info:

Error compiling Cython file:
------------------------------------------------------------
...
from libcpp.vector cimport vector
from preshed.maps cimport PreshMap
^
------------------------------------------------------------

vectors.pxd:2:0: 'preshed\maps.pxd' not found
Processing sense2vec\vectors.pyx
Traceback (most recent call last):
  File "C:\Python27\src\sense2vec\bin\cythonize.py", line 199, in <module>
    main()
  File "C:\Python27\src\sense2vec\bin\cythonize.py", line 195, in main
    find_process_files(root_dir)
  File "C:\Python27\src\sense2vec\bin\cythonize.py", line 187, in find_proce

ss_files
process(cur_dir, fromfile, tofile, function, hash_db)
File "C:\Python27\src\sense2vec\bin\cythonize.py", line 161, in process
processor_function(fromfile, tofile)
File "C:\Python27\src\sense2vec\bin\cythonize.py", line 81, in process_pyx

    raise Exception('Cython failed')
Exception: Cython failed
Cythonizing sources
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python27\src\sense2vec\setup.py", line 165, in <module>
    setup_package()
  File "C:\Python27\src\sense2vec\setup.py", line 122, in setup_package
    generate_cython(root, src_path)
  File "C:\Python27\src\sense2vec\setup.py", line 63, in generate_cython
    raise RuntimeError('Running cythonize failed')
RuntimeError: Running cythonize failed

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in C:\Python27\src\s
ense2vec\

I have tried to install it with Python 3.6, but the same errors occur. I have tried many things, but nothing worked ... I'm sorry, I am not a programmer. I don't understand what the problem is.

Can you please help me?

ImportError: No module named vectors

I've installed sense2vec inside a fresh virtualenv on an Ubuntu 14.04 machine python 2.7.6.

pip install -e git+git://github.com/spacy-io/sense2vec.git#egg=sense2vec

The Cython code does not seem to be compiled.

(sense2vec) ~/sense2vec$ python -m sense2vec.download
/home/ubuntu/sense2vec/bin/python: No module named vectors

Here are the installed packages.

(sense2vec) ~/sense2vec$ pip freeze
argparse==1.2.1
cloudpickle==0.2.1
cymem==1.30
murmurhash==0.26.1
numpy==1.10.4
plac==0.9.1
preshed==0.46.2
semver==2.4.1
-e git://github.com/spacy-io/sense2vec.git@e27522f838739f033c048064dfc3077f7a4e956f#egg=sense2vec-master
six==1.10.0
spacy==0.100.6
sputnik==0.9.3
thinc==5.0.6
ujson==1.35
wsgiref==0.1.2

Same error without virtualenv.

thx

How to calculate the similarity between two words

Installed the sense2vec successfully and run model.most_similar(query_vector) without any problem, i.e., I can get most similar words.

However, trying to run the example of similarity score of two words, model.similarity('bacon|NOUN', 'broccoli|NOUN'), it yields the error message:
AttributeError: 'sense2vec.vectors.VectorMap' object has no attribute 'similarity'.

So, what's the right method to calculate the score of two words? Thanks!

Installation of sense2vec: 'sense2vec/vector.cpp'

Hello,

I've been trying to install sense2vec, and although I think I've made some progress, I seem to be stuck with the following error:
fatal error C1083: Cannot open source file: 'sense2vec/vectors.cpp'. It claims to file exists.

I am installing using pip, i.e. 'pip install -e git+git://github.com/spacy-io/sense2vec.git#egg=sense2vec'

Thank you in advance.

Some errors come up when I install the sense2vec.

I install the spacy package and download the sense2vec project.When I unzip the sense2vec-master.zip and run the 'python setup.py install' to install the sense2vec,there are some errors.

`cadevil@cadevil:~/zrj/sense2vec-master$ python setup.py install
Cythonizing sources
sense2vec/vectors.pyx has not changed
running install
running bdist_egg
running egg_info
writing requirements to sense2vec.egg-info/requires.txt
writing sense2vec.egg-info/PKG-INFO
writing top-level names to sense2vec.egg-info/top_level.txt
writing dependency_links to sense2vec.egg-info/dependency_links.txt
reading manifest file 'sense2vec.egg-info/SOURCES.txt'
writing manifest file 'sense2vec.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'sense2vec.vectors' extension
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/cadevil/anaconda/include/python2.7 -I/home/cadevil/zrj/sense2vec-master/include -I/home/cadevil/anaconda/include/python2.7 -c sense2vec/vectors.cpp -o build/temp.linux-x86_64-2.7/sense2vec/vectors.o -O3 -Wno-unused-function -fopenmp -fno-stack-protector
cc1plus: warning: command line option β€˜-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/cadevil/zrj/sense2vec-master/include/numpy/ndarraytypes.h:1804:0,
from /home/cadevil/zrj/sense2vec-master/include/numpy/ndarrayobject.h:17,
from /home/cadevil/zrj/sense2vec-master/include/numpy/arrayobject.h:4,
from sense2vec/vectors.cpp:267:
/home/cadevil/zrj/sense2vec-master/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
sense2vec/vectors.cpp: In function β€˜PyObject* pyx_pf_9sense2vec_7vectors_11VectorStore_12load(pyx_obj_9sense2vec_7vectors_VectorStore, PyObject)’:
sense2vec/vectors.cpp:6769:11: error: cannot declare pointer to β€˜float&’
float &pyx_v_ptr;
^
sense2vec/vectors.cpp: In function β€˜void pyx_f_9sense2vec_7vectors_linear_similarity(int, float, float
, int, const float
, int, const float
const
, const float
, int, __pyx_t_9sense2vec_7vectors_do_similarity_t)’:
sense2vec/vectors.cpp:7323:42: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
__pyx_t_4 = ((__pyx_v_queue.size() > __pyx_v_nr_out) != 0);
^
error: command 'gcc' failed with exit status 1

`
I update the cython to the newest version and the problem didn't be solved.The system is Unbuntu and the python version is 2.7

Error while using merge_text.py

Hi,

I am getting the following error while pre-processing data with merge_text.py:

Traceback (most recent call last):
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 143, in <module>
    main('a','b', n_workers=1)
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 139, in main
    parallelize(do_work, enumerate(jobs), n_workers, [out_dir])
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 48, in parallelize
    return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 180, in __init__
    self.results = batch()
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 92, in parse_and_transform
    file_.write(transform_doc(nlp(strip_meta(text))))
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 98, in transform_doc
    for np in doc.noun_chunks:
  File "spacy/tokens/doc.pyx", line 246, in noun_chunks (spacy/tokens/doc.cpp:7745)
    for start, end, label in self.noun_chunks_iterator(self):
  File "spacy/syntax/iterators.pyx", line 11, in english_noun_chunks (spacy/syntax/iterators.cpp:1559)
    word = doc[i]
  File "spacy/tokens/doc.pyx", line 100, in spacy.tokens.doc.Doc.__getitem__ (spacy/tokens/doc.cpp:4853)
    if self._py_tokens[i] is not None:
IndexError: list index out of range

I run the merge_text.py with n_workers=1. Btw. I did a minor change of function iter_comments to work with plain text input:

def iter_comments(loc):
    with open(loc) as file_:
        for i, line in enumerate(file_):
            yield line

Do you have an idea why this happen?

Thank you,
Adam

Problem installing

when I try to install via pip install -e git+git://github.com/spacy-io/sense2vec.git#egg=sense2vec
or via pip install sense2vec, I get following error message

>>> import sense2vec
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alex/anaconda2/lib/python2.7/site-packages/sense2vec/__init__.py", line 2, in <module>
    from .vectors import VectorMap
ImportError: /home/alex/anaconda2/lib/python2.7/site-packages/sense2vec/vectors.so: undefined symbol: _ZTINSt8ios_base7failureB5cxx11E

I am using ubuntu 16.04 and python 2.7 with anaconda 4.2.9. (installed requirements-all.txt)
the problem did not occur on a mac sierra 10.12.1

sense2vec (0.6.0) not loading after upgrading spacy to version 1.8.2

sense2vec (0.6.0) is not working with latest spacy version 1.8.2.

I’m running Python (2.7.13) Anaconda version (4.3.22) on Ubuntu 14.04.4 LTS

It was working fine with spacy version 0.101.0, but after upgrading spacy and its corresponding model I’m unable to load sense2vec. Its throwing ValueError: spacy.strings.StringStore has the wrong size

If I try to reinstall sense2vector, the spacy version getting reverted back to 0.101.0.

We need to upgrade spacy to latest version for German and Spanish language support, and also need to continue using sense2vector in one of our existing functionality.

Any idea how to resolve the current issue and have spacy (1.8.2) + textacy (0.3.4) + sense2vec (0.6.0) together on my system?

Here is spacy information installed on my system:
image

image

And here is the error I’m getting when trying to import sense2vec:
image

Argument Key has the incorrect type

When working through in my python terminal when I enter
freq, query_vector = model["natural_language_processing|NOUN"]
I get this error. I am just trying to simply use the basic model.

screen shot 2017-11-12 at 3 26 55 pm
screen shot 2017-11-12 at 3 25 58 pm

Similarity using Sense2Vec along with Spacy

I am not able to leverage the similarity examples from the blog or issue #24 to my case. I just wanted to get the "sense" of "bear" i.e Homonyms in an example like "The bear growled. She could not bear the pain."

import spacy
nlp = spacy.load("en_core_web_md")
from sense2vec import Sense2VecComponent
text = "The bear growled. She could not bear the pain."
s2v = Sense2VecComponent('./reddit_vectors-1.1.0')
nlp.add_pipe(s2v)
doc = nlp(text)

After this i can't get any official reference for the similarity API that took two vectors. I have tried with the following but i get errors-

s2v._.similarity(doc[1]._.s2v_vec,doc[6]._.s2v_vec)
nlp._.similarity(doc[1]._.s2v_vec,doc[6]._.s2v_vec)

Unable to load model

When I tried to load the model via "model = sense2vec.load()" I get the following error:

RuntimeError("Model not installed. Please run 'python -m "
RuntimeError: Model not installed. Please run 'python -m sense2vec.download' to install latest compatible model.

Then I tried to execute the command "'python -m sense2vec.download" and I got another error:

File "C:\Users\rg\Anaconda2\lib\runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "C:\Users\rg\Anaconda2\lib\runpy.py", line 72, in run_code
exec code in run_globals
File "c:\users\rg\src\sense2vec\sense2vec\download.py", line 38, in
plac.call(main)
File "C:\Users\rg\Anaconda2\lib\site-packages\plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "C:\Users\rg\Anaconda2\lib\site-packages\plac_core.py", line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), *_kwargs)
File "c:\users\rg\src\sense2vec\sense2vec\download.py", line 20, in main
sputnik.package(about.title, about.version, about.default_model)
AttributeError: 'module' object has no attribute 'title'

Can you please help me?

Error when loading reddit_vectors-1.1.0

Adding to existing Jupyter notebook that is run from a docker container, I do the following:

import sense2vec
s2v = sense2vec.load('reddit_vectors-1.1.0')

which throws the error:

/usr/local/lib/python3.5/dist-packages/sense2vec/util.py in get_package_by_name(name, via)
     18                                name or about.default_model, data_path=via)
     19     except PackageNotFoundException as e:
---> 20         raise RuntimeError("Model not installed. Please run 'python -m "
     21                            "sense2vec.download' to install latest compatible "
     22                            "model.")

Model not installed. Please run 'python -m sense2vec.download' to install latest compatible model.

I then create a local image of the container using a TensorFlow Dockerfile, and in the Dockerfile I issue RUN pip install sense2vec when building my Docker container, but if I do a RUN ipython -m sense2vec.download I get an error that print("Model already installed. Please run '%s --force to reinstall." % sys.argv[0], file=sys.stderr)
AttributeError: module 'sense2vec.about' has no attribute '__title__'.

The same happens if I SSH into the Docker container, that is, if I do an ipython -m sense2vec.download from the command line I get the error that print("Model already installed. Please run '%s --force to reinstall." % sys.argv[0], file=sys.stderr)

The iPython version is 3.5. Docker container's OS version is 64-bit: SMP Wed Mar 14 15:12:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I just noticed the requirement of CPython. Since Jupyter uses iPython, I suppose there is no way to get this to work? even if so, why the discrepancy is error from outside the container and inside it?

Please advise.

Potential Issues with the same Term being tagged with two "senses"

I was doing some casual testing of similarity and found something that seems to be a bug or at least an inconsistency that could be problematic for people.

It looks like the Entity tagging for the original training set behaves slightly differently than the tagging used for on-the-fly similarity comparisons. The issue is primarily focused around names of people. So, for at least a few different people ("Quentin Tarantino", "Dan Harmon", etc) they appear within the vectors table tagged as a PERSON. However, when parsing documents on the fly with the model and then looking for similarity, the keys aren't found. This is because they are being compared using a differently calculated "sense". They both show up within the target doc tagged as PROPN.

So, here you see that Dan_Harmon|PERSON is returned as semantically related to the phrase "writers_room" - which is perfectly legit and makes a ton of sense.

image

However, if I try to parse the sentence: "Dan Harmon is one of my favorite writers" or "I really like Dan Harmon as a writer", it yields a Key Error with ...

image

There are probably a few ways to change this around (maybe by pre-emptively changing out PROPN tokens to PERSON ? - but that feels like it will almost certainly backfire).

Let me know your thoughts.

I'm trying to work with the reddit_vectors. Can't find it after install.

Hello, anybody?
I'm having trouble using the reddit vectors after downloading.
I've downloaded but I can't get the vectors to take in the example code.

>>> s2v = Sense2VecComponent('./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/vectors.bin')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/__init__.py", line 32, in __init__
self.s2v = load(vectors_path)
File "./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/__init__.py", line 10, in load
vector_map.load(vectors_path)
File "vectors.pyx", line 208, in sense2vec.vectors.VectorMap.load
File "vectors.pyx", line 306, in sense2vec.vectors.VectorStore.load
File "cfile.pyx", line 13, in sense2vec.cfile.CFile.__init__
OSError: Could not open binary file b'./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/vectors.bin/vectors.bin'

If I do a search for reddit vectors this is what I get.

$ locate reddit_vectors-1.1.0
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/__cache__/reddit_vectors-1.1.0
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/__cache__/reddit_vectors-1.1.0/archive.gz
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/__cache__/reddit_vectors-1.1.0/meta.json
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/freqs.json
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/meta.json
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/strings.json
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/vectors.bin

Blog post question

For the first code snippet in this link:

https://explosion.ai/blog/sense2vec-with-spacy

def transform_texts(texts):
    # Load the annotation models
    nlp = English()
    # Stream texts through the models. We accumulate a buffer and release
    # the GIL around the parser, for efficient multi-threading.
    for doc in nlp.pipe(texts, n_threads=4):
        # Iterate over base NPs, e.g. "all their good ideas"
        for np in doc.noun_chunks:
            # Only keep adjectives and nouns, e.g. "good ideas"
            while len(np) > 1 and np[0].dep_ not in ('amod', 'compound'):
                np = np[1:]
            if len(np) > 1:
                # Merge the tokens, e.g. good_ideas
                np.merge(np.root.tag_, np.text, np.root.ent_type_)
            # Iterate over named entities
            for ent in doc.ents:
                if len(ent) > 1:
                    # Merge them into single tokens
                    ent.merge(ent.root.tag_, ent.text, ent.label_)
        token_strings = []
        for token in tokens:
            text = token.text.replace(' ', '_')
            tag = token.ent_type_ or token.pos_
            token_strings.append('%s|%s' % (text, tag))
        yield ' '.join(token_strings)

where is the "tokens" variable defined (from the for loop)?

sense2vec drags spacy to older version?

I'm wanting to use the reddit vectors for analogies. When I install sense2vec, I run into the old issue that lexemes are unhashable. I saw that this was fixed in September of 2016, so I updated spacy, but then sense2vec complains:

spacy.strings.StringStore has the wrong size, try recompiling

Recompiling leads to unhashable lexemes again...

Am I right that the sense2vec installation uses an earlier version of spacy, or am I just clueless? Any ideas for a workaround?

"Can't run sense2vec: document not tagged" when using nlp.pipe()

I just installed sense2vec from pip (v1.0.0a0), and I wanted to use s2v with spacy's nlp pipeline. However, when I entered the pipe, the script fails and throws this error:

Traceback (most recent call last):
  File "text_extract.py", line 29, in <module>
    for row, doc in enumerate(nlp.pipe(texts, n_threads=8, batch_size=100)):
  File "/usr/local/lib/python3.5/dist-packages/spacy/language.py", line 578, in pipe
    for doc in docs:
  File "/usr/local/lib/python3.5/dist-packages/spacy/language.py", line 753, in _pipe
    doc = func(doc)
  File "/usr/local/lib/python3.5/dist-packages/sense2vec/__init__.py", line 40, in __call__
    raise ValueError("Can't run sense2vec: document not tagged.")
ValueError: Can't run sense2vec: document not tagged.

I noticed that here you commented out the two lines at lines 23-24:

    #if not doc.is_tagged:
    #    raise ValueError("Can't run sense2vec: document not tagged.")

Once I did the same in my version, I was able to successfully use the pipeline. Perhaps all this issue needs is a readme change? It looks like your current version on github fixes this problem, but the suggested pip install breaks when using nlp.pipe().

Segmentation Fault with sense2vec loading

I successfully installed sense2vec with python2.7 on Ubuntu.
I installed python -m sense2vec.download
I successully loaded the model multiples times with the following code:

>>import sense2vec
>>model = sense2vec.load()
>>

but, now the sense2vec load creates Segmentation Fault (code dumped)

>>import sense2vec
>>model = sense2vec.load()
Segmentation Fault

I force download the model again with download.py --force, and I'm still locked with Segmentation Fault.. Any idea to slove it ?

Unable to download reddit_vectors model

Hi @honnibal ,

I am getting the following error when I execute:
$ python -m sense2vec.download

File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/Users/boscoraju/src/sense2vec/sense2vec/download.py", line 38, in
plac.call(main)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), *_kwargs)
File "/Users/boscoraju/src/sense2vec/sense2vec/download.py", line 26, in main
package = sputnik.install(about.title, about.version, about.default_model)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sputnik/init.py", line 37, in install
index.update()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sputnik/index.py", line 84, in update
index = json.load(session.open(request, 'utf8'))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sputnik/session.py", line 43, in open
r = self.opener.open(request)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 466, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 484, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1297, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1256, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:645)>

Looks like it is unable to make a connection. Could you point me to right direction?

Thank you.

Support for Non-English languages

Google's Syntaxnet has released pre-trained models for 40 other languages.

May I know if any of these can be used (with Spacy's Sense2Vec) to train word embeddings in languages other than English?

Thanks

Error compiling Cython file: Cython failed

  • pip install -r requirements.txt : All requirements are satisfied

  • pip install -e .

Obtaining file:///opt/sense2vec
Running setup.py (path:/opt/sense2vec/setup.py) egg_info for package from file:///opt/sense2vec


Error compiling Cython file:
    ------------------------------------------------------------
    ...
    from libcpp.vector cimport vector
    from preshed.maps cimport PreshMap
    ^
    ------------------------------------------------------------
    
    vectors.pxd:2:0: 'preshed/maps.pxd' not found
    Processing sense2vec/vectors.pyx
    Traceback (most recent call last):
      File "/opt/sense2vec/bin/cythonize.py", line 199, in <module>
        main()
      File "/opt/sense2vec/bin/cythonize.py", line 195, in main
        find_process_files(root_dir)
      File "/opt/sense2vec/bin/cythonize.py", line 187, in find_process_files
        process(cur_dir, fromfile, tofile, function, hash_db)
      File "/opt/sense2vec/bin/cythonize.py", line 161, in process
        processor_function(fromfile, tofile)
      File "/opt/sense2vec/bin/cythonize.py", line 72, in process_pyx
        raise Exception('Cython failed')
    Exception: Cython failed
    Cythonizing sources
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/opt/sense2vec/setup.py", line 165, in <module>
        setup_package()
      File "/opt/sense2vec/setup.py", line 122, in setup_package
        generate_cython(root, src_path)
      File "/opt/sense2vec/setup.py", line 63, in generate_cython
        raise RuntimeError('Running cythonize failed')
    RuntimeError: Running cythonize failed

Default pos while querying

I've seen in the demo, queries can be made without specifying POS tag and it defaults to an "auto" sense. Is there a way to replicate the same while querying with this model too? Or a way to query phrases like the "fair_game" example in the demo?

BufferError: Object is not writable.

Hello,

I'm getting this error back from Cython after just trying the example in the README. Could there be a version mismatch or something? I believe I installed according to the README as well.

Thanks!

Help loading model

I downloaded the trained model from:

https://index.spacy.io/models/reddit_vectors-1.0.1/archive.gz

How can I load this into a VectorMap or a gensim model in order to make similarity queries?

Sense2vec and Spacy: How to choose the "sense" i.e. POS or entity labels

Using Sense2vec in conjunction with Spacy, is there a way to choose the part-of speech tag/ entity label for a token when the attribute s2v_most_similar is applied?

e.g. for the token "duck", the default sense/POS is NOUN when the attribute s2v_most_similar is applied.
Using spacy with sense2vec, is there a way to get the s2v_most_similar for "duck" as in the VERB?

Thanks!

pip install sense2vec==1.0.0a0 fails with "Failed to build Wheel" / wrong version of Spacy

I went about trying to install sense2vec through pip with pip install sense2vec==1.0.0a0 but ended up with a lot of output to stdout. In it, I see four errors:

PyThreadState {aka struct _ts}’ has no member named β€˜exc_type’; did you mean β€˜curexc_type’?
Failed building wheel for spacy
error: command 'gcc' failed with exit status 1
Failed building wheel for thinc

I can't figure out exactly what is causing the problem. In the output there's a line that states that my version of SpaCy is out of date, yet pip shows I'm using Spacy 2.0. Package/OS specs are as follow, and attached is a text of the output
sense2vec.txt

$ pip show spacy
Name: spacy
Version: 2.0.11
Summary: Industrial-strength Natural Language Processing (NLP) with Python and Cython
Home-page: https://spacy.io
Author: Explosion AI
Author-email: [email protected]
License: MIT
Location: /home/***/anaconda2/envs/ask/lib/python3.7/site-packages
Requires: numpy, murmurhash, cymem, preshed, thinc, plac, pathlib, ujson, dill, regex
Required-by: en-core-web-sm

$ python --version
Python 3.7.0

$ uname -r
4.10.0-42-generic

$ gcc --version
gcc (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'm not able to use the '.add_pipe' attribute

>>> nlp.add_pipe(s2v)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'English' object has no attribute 'add_pipe'

Is it called something else now?

Error when trying to open Reddit Vectors downloaded

image

The code right now is super-simple:

from sense2vec import Sense2VecComponent
import spacy

nlp = spacy.load('en_core_web_sm')
s2v = Sense2VecComponent('./reddit-vectors-1.1.0')
nlp.add_pipe(s2v)

Fails with "Could not open binary file" - the file is definitely there. Installed into its own venv with this sense2vec dev branch + spacy + newly downloaded en_core_web_sm.

fail to install

Environment:
python: 2.7.12
use anaconda to create a clean environment, following the steps:

  1. cloning the repository
  2. run pip install -r requirements.txt
  3. pip install -e .

Then found the following errors:
sense2vec/vectors.cpp:8061:10: error: '__pyx_v_ptr' declared as a pointer to a reference of type 'float &'
float &*__pyx_v_ptr;
^
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1

Could someone provide the solution?

Thanks,

Sense2vec Similarity Question

Why do 'flies|VERB' and 'flies|NOUN' have a similarity of 1.0?
I'm running sense2vec on Anaconda, with Python 3.6 on OS X 10.11.6

$ python --version
Python 3.6.3 :: Anaconda custom (64-bit)
$ sputnik --name sense2vec --repository-url http://index.spacy.io install reddit_vectors
Downloading...
Downloaded 560.90MB 100.00% 2.15MB/s eta 0s              
archive.gz checksum/md5 OK
INFO:sputnik.pool:install reddit_vectors-1.1.0
$ conda list spacy
# packages in environment at /Users/davidlaxer/anaconda/envs/spacy:
#
spacy                     2.0.4                    py36_0    conda-forge
spacy                     0.101.0                   <pip>
$ conda list sense2vec
# packages in environment at /Users/davidlaxer/anaconda/envs/spacy:
#
sense2vec                 0.6.0                     <pip>
$ conda list thinc
# packages in environment at /Users/davidlaxer/anaconda/envs/spacy:
#
thinc                     6.10.0                   py36_0    conda-forge
thinc                     5.0.8                     <pip>

Here's my example:

import sense2vec
model = sense2vec.load()
freq, query_vector1 = model["flies|NOUN"]
model.most_similar(query_vector1, n=5)
(['flies|NOUN', 'gnats|NOUN', 'snakes|NOUN', 'birds|NOUN',  'grasshoppers|NOUN'],
 <MemoryView of 'ndarray' at 0x1af394c540>)

freq, query_vector2 = model["flies|VERB"]
model.most_similar(query_vector2, n=5)

(['flies|VERB', 'flys|VERB', 'flying|VERB', 'jumps|VERB', 'swoops|VERB'],
 <MemoryView of 'ndarray' at 0x1af394c6e8>)
In [42]: model.data.similarity(query_vector1, query_vector1)
1.0

screen shot 2018-01-05 at 2 53 08 pm

From a model I trained:

In [40] new_model = gensim.models.Word2Vec.load('/Users/davidlaxer/LSTM-Sentiment-Analysis/corpus_output_256.txt')
In [41] new_model.similarity('flies|NOUN', 'flies|VERB')
0.9954307438574328
In [43] new_model.wv.vocab["flies|VERB"].index
5895
In [44] new_model.wv.vocab["flies|NOUN"].index
7349
In [45] new_model.wv["flies|VERB"]  
array([ 0.15279259,  0.04471067,  0.0923325 , -0.07349139,  0.04180749,
     -0.71864516,  0.08252977, -0.02405624,  0.28384277,  0.01706951,
     -0.15931296, -0.21216595, -0.0352594 ,  0.13597694,  0.07868216,
     -0.15907238, -0.30132023,  0.01954124,  0.22636545, -0.19983807,
     -0.03842518,  0.49959993, -0.18679027, -0.16045345,  0.05813084,
      0.12905809,  0.1305625 ,  0.42689237,  0.19311258, -0.1002808 ,
      0.07427863, -0.19840011,  0.42542475, -0.32158205,  0.15129171,
     -0.32177079, -0.04034998, -0.05301504,  0.38441092, -0.31020632,
      0.42528978, -0.26249531, -0.25648555,  0.16558036,  0.28656447,
     -0.11909373,  0.09208378, -0.08886475, -0.40061441,  0.02873728,
      0.07275984, -0.05674595, -0.09471942, -0.01308586, -0.2777423 ,
     -0.05253473, -0.00179329, -0.15887854,  0.31784746, -0.00895729,
      0.50658983,  0.09232203,  0.16289137, -0.20241632, -0.01240843,
      0.20972176,  0.065593  ,  0.40676439, -0.16795945,  0.08079262,
      0.27334401,  0.16058736, -0.15362383, -0.13958427,  0.17041191,
     -0.08574789, -0.20200305,  0.16288304,  0.11220794,  0.44721738,
     -0.14058201,  0.13652138, -0.0134679 ,  0.20938247,  0.34156594,
      0.21730828, -0.19907214,  0.02451441,  0.12492239,  0.08635994,
     -0.29003018,  0.01458945,  0.02637799,  0.10671763, -0.17983682,
      0.01115436, -0.02827467,  0.13415532,  0.4656623 , -0.34222263,
      0.44238791, -0.29407004, -0.16681372,  0.04466435, -0.21825369,
     -0.09138768,  0.02407285, -0.57841706, -0.19544049, -0.07518575,
      0.36430466, -0.13164517, -0.01708322,  0.11068137,  0.2811991 ,
      0.02544841,  0.10672008,  0.06147943,  0.09167367, -0.71296901,
      0.04190712, -0.47360554, -0.01762259,  0.0359503 , -0.24351278,
     -0.01718491, -0.04033662,  0.03032484, -0.33736056, -0.13555804,
      0.02156358, -0.50073934, -0.0706998 ,  0.41698509, -0.23886077,
     -0.06120266, -0.0681426 ,  0.15182504,  0.13283113, -0.05899575,
     -0.11477304, -0.18594885, -0.17855589,  0.31381837,  0.25157636,
      0.41943148,  0.05070408, -0.03173119, -0.04240219, -0.25305411,
     -0.36856946,  0.20292452,  0.10858628,  0.17122397,  0.01447193,
     -0.47961271, -0.45739996,  0.17185016, -0.03916142, -0.04544915,
      0.34947339,  0.04178765,  0.37088165,  0.14284173,  0.03443905,
      0.30170318,  0.05259432, -0.22402297,  0.05495254, -0.46103877,
     -0.22059456, -0.27414244,  0.55484813,  0.1569699 ,  0.35761088,
      0.08712664,  0.23313828, -0.25803107, -0.03343969, -0.14713305,
     -0.0611255 ,  0.17435439, -0.01603068,  0.00526717, -0.08379596,
     -0.08644171, -0.12666632,  0.12955435,  0.48045933, -0.17596652,
     -0.29505005,  0.60152525, -0.01975689,  0.02343576,  0.17027852,
     -0.06638149, -0.10826188, -0.41277543, -0.12114278, -0.01596882,
      0.02660148,  0.22383556, -0.030263  , -0.0768819 , -0.32506746,
     -0.15082234, -0.16559191, -0.08502773, -0.01570902, -0.22921689,
      0.19637343, -0.4993245 ,  0.19670881,  0.17284806,  0.10345648,
      0.45276237, -0.12255403,  0.18032061,  0.05677452,  0.09869532,
     -0.23536956, -0.22449525,  0.51938456,  0.24111946,  0.26022053,
     -0.18190917, -0.01768251,  0.00435291,  0.05820792, -0.46525213,
      0.17490779,  0.15250422, -0.1760795 ,  0.14194083,  0.09954269,
     -0.89346975, -0.11642933,  0.0944154 ,  0.2134015 , -0.01955901,
     -0.02899018,  0.07254739, -0.03995875,  0.39499217, -0.05394226,
     -0.07821836, -0.29973337, -0.11607374, -0.01082127,  0.36769736,
      0.04288069, -0.0461933 ,  0.00675509,  0.25210902, -0.21784271,
     -0.18479778], dtype=float32)
In [46]: new_model.wv["flies|NOUN"]
array([ 0.1304135 ,  0.05724983,  0.06886293, -0.03062466,  0.01640639,
     -0.53799176,  0.10968599, -0.02839088,  0.18814373,  0.00147691,
     -0.11227507, -0.14502132, -0.03685957,  0.06422875,  0.07289967,
     -0.10437401, -0.23557086,  0.00153201,  0.17661473, -0.12828164,
     -0.02789859,  0.35942602, -0.1580196 , -0.13264264,  0.03343309,
      0.10922851,  0.1102568 ,  0.29480889,  0.14417146, -0.07892705,
      0.06608826, -0.14885685,  0.32329369, -0.23263605,  0.11967299,
     -0.23964159, -0.02619613,  0.00930338,  0.31111386, -0.22507732,
      0.32475442, -0.19287167, -0.19306417,  0.10722513,  0.2237518 ,
     -0.06828826,  0.07246322, -0.06233693, -0.31375739,  0.01069155,
      0.04457425, -0.00323939, -0.05079295, -0.02164256, -0.22060572,
     -0.03816675,  0.00503534, -0.10069088,  0.24429323,  0.02505454,
      0.38344654,  0.09145252,  0.11439045, -0.10801487, -0.01075712,
      0.16894275,  0.04799445,  0.3149668 , -0.13885498,  0.02068597,
      0.17856079,  0.11587915, -0.11973458, -0.0896498 ,  0.11993878,
     -0.06647626, -0.15219077,  0.10705566,  0.07842658,  0.31101131,
     -0.12788543,  0.09909476,  0.00878725,  0.1618593 ,  0.22566552,
      0.1297064 , -0.14370884,  0.02069237,  0.08489513,  0.0567583 ,
     -0.21860926,  0.01057386,  0.03844477,  0.06213358, -0.12877114,
      0.02327059, -0.00917741,  0.11733869,  0.35853127, -0.25572705,
      0.30879059, -0.20568153, -0.12405248,  0.03546307, -0.18377842,
     -0.06700096,  0.00626029, -0.42848313, -0.13129929, -0.04215423,
      0.26977378, -0.07725398,  0.01177794,  0.05952175,  0.21516307,
      0.01055368,  0.06727242,  0.05038245,  0.06739338, -0.53844106,
      0.02834721, -0.33890292, -0.02644366,  0.03540507, -0.16382404,
     -0.01353777, -0.02502321,  0.00226415, -0.24348356, -0.12502551,
      0.01489578, -0.37660655, -0.05798845,  0.28748602, -0.18512824,
     -0.06250153, -0.06967189,  0.14023623,  0.09628384, -0.09925015,
     -0.07317897, -0.14045765, -0.14597888,  0.24456802,  0.173549  ,
      0.3357946 ,  0.0424754 ,  0.00723427, -0.02120454, -0.14892557,
     -0.26496273,  0.14844348,  0.06555442,  0.11951103,  0.03691757,
     -0.36404395, -0.32292312,  0.09412326, -0.06377046, -0.02561374,
      0.24361259,  0.02616721,  0.29151902,  0.1178301 ,  0.03284379,
      0.20218852,  0.0337379 , -0.14703217,  0.02869225, -0.31447497,
     -0.15038867, -0.23353554,  0.41700551,  0.11959957,  0.26917797,
      0.04590914,  0.16029988, -0.18795538, -0.01343729, -0.10532234,
     -0.02617499,  0.12019841,  0.00673278, -0.0070972 , -0.03176219,
     -0.07582191, -0.07277017,  0.09928112,  0.36159652, -0.14404564,
     -0.21233276,  0.46463615,  0.01645906,  0.01815237,  0.12149289,
     -0.07040837, -0.06278557, -0.29605272, -0.07451538,  0.00487611,
      0.00313085,  0.13640559, -0.02045129, -0.05790693, -0.22582445,
     -0.10382047, -0.13318184, -0.05160375,  0.01498237, -0.15075362,
      0.14116266, -0.36445442,  0.1420894 ,  0.11182524,  0.10055254,
      0.33450282, -0.08930281,  0.15410167,  0.03961684,  0.06431124,
     -0.15608449, -0.1599745 ,  0.3780185 ,  0.18073064,  0.2190931 ,
     -0.16039631, -0.03769958, -0.00069833,  0.06914425, -0.33746576,
      0.11075038,  0.11626988, -0.12498619,  0.07928085,  0.0636186 ,
     -0.6352759 , -0.10650127,  0.03810085,  0.14585988, -0.01552053,
     -0.01488287,  0.04300846, -0.00500007,  0.26444513, -0.03629581,
     -0.04127173, -0.23304868, -0.08911316,  0.0029219 ,  0.27401808,
      0.00279731, -0.04162024,  0.00214672,  0.15316918, -0.14298579,
     -0.15343791], dtype=float32)


Assertion error loading vector map

Model was trained with Gensim 1.0, and then converted into Sense2vec format with gensim2sense.py after adding wv for it to work with the new version of Gensim. After generating the three files freqs.json, vectors.bin and strings.json, loading the model with Vector_map has the following problem:

File "sense2vec/vectors.pyx", line 208, in sense2vec.vectors.VectorMap.load (sense2vec/vectors.cpp:6016)
File "sense2vec/vectors.pyx", line 319, in sense2vec.vectors.VectorStore.load (sense2vec/vectors.cpp:8554)
File "sense2vec/vectors.pyx", line 238, in sense2vec.vectors.VectorStore.add (sense2vec/vectors.cpp:7182)
AssertionError

It works before, except training the model with very older Gensim, which I cannot remember the version. Help is appreciated!

sense2vec and spacy

I get an error when I try to import sense2vec after I import spacy (v2.0.11):

import sense2vec

File "/usr/local/lib/python3.5/dist-packages/sense2vec/init.py", line 2, in
from .vectors import VectorMap
File ".env/lib/python2.7/site-packages/spacy/strings.pxd", line 18, in init sense2vec.vectors (sense2vec/vectors.cpp:26598)
ValueError: spacy.strings.StringStore has the wrong size, try recompiling

I am using python3- is that the issue?

Error using the most similar method

Following the successful installation of sense2vec, I got the model loaded as described in the response to the issue #3, but I am getting an error when I try to use the most_similar method.

Following is what I entered after loading the model:
print vector_map.most_similar("education", topn=10)

Below is the error I receive.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-7f468f5b06ca> in <module>()
----> 1 print vector_map.most_similar("education", topn=10)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.most_similar (sense2vec/vectors.cpp:3363)()
     66             yield (string, freq, self.data[i])
     67 
---> 68     def most_similar(self, float[:] vector, int n):
     69         indices, scores = self.data.most_similar(vector, n)
     70         return [self.strings[idx] for idx in indices], scores

TypeError: most_similar() takes exactly 2 positional arguments (1 given)

So I understand that the most_similar method wants a float parameter followed by an int parameter. I thought the function will expect similar arguments as to gensim's word2vec implementation of most_similar method.

I request if please I could be shown how to use the most_similar method in the sense2vec implementation.

Incompatible spaCy model when using sense2vec

Installing sense2vec rolls back spacy version to 0.101.0 as documented here: #25

However, none of the current english spaCy models are compatible with 0.101.0 and raises this error when trying to load:

super(Package, self).__init__(defaults=meta['package']) KeyError: 'package'

Is there a way to download old spacy models? I mainly want to use spacy to get entity tagging (NOUN, GPE etc), which can then be passed to space2vec.

Thank you!

most_similar method

I would like to do operations of type trained_model.most_similar (positive = ['woman', 'king'], negative = ['man']) = [('queen', 0.50882536), ...], however, of sense2vec does not expect positive and negative parameters. Did someone have to do this implementation? Is this planned to be developed at sense2vec?

On installation: UnicodeDecodeError.

I tried to install sense2vec from python 3.6.5, 64bit and encountered the following encoding error:

py -3 -m pip install sense2vec==1.0.0a0
Collecting sense2vec==1.0.0a0
Using cached https://files.pythonhosted.org/packages/28/4a/a1d9a28545adc839789c1442e7314cb0c70b8657a885f9e5b287fade7814/sense2vec-1.0.0a0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\MyUser\AppData\Local\Temp\pip-install-21y0wglc\sense2vec\setup.py", line 169, in
setup_package()
File "C:\Users\MyUser\AppData\Local\Temp\pip-install-21y0wglc\sense2vec\setup.py", line 107, in setup_package
readme = f.read()
File "C:\Program Files\Python64\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 13089: character maps to

Any quick ideas on how to fix that?

Error while opening own trained vectors file

I was able to train data using train_word2vec.py after preprocessing the data using merge_text.py.
Below is the outcome of train_word2vec.py:

vectors

Then I input the vectors.bin to the new version 0.2.0 of sense2vec and I got an IOerror. The following is what I put to load the vectors:

from sense2vec.vectors import VectorMap
vector_map = VectorMap(128)
vector_map.load("/home/noname/Documents/data/vectors")

The error:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-9-315510f2d9d1> in <module>()
      1 vector_map = VectorMap(128)
----> 2 vector_map.load("/home/noname/Documents/data/vectors")

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.load (sense2vec/vectors.cpp:4870)()
    100 
    101     def load(self, data_dir):
--> 102         self.data.load(path.join(data_dir, 'vectors.bin'))
    103         with open(path.join(data_dir, 'strings.json')) as file_:
    104             self.strings.load(file_)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorStore.load (sense2vec/vectors.cpp:7049)()
    200         cdef float[:] cv
    201         for i in range(nr_vector):
--> 202             cfile.read_into(&tmp[0], self.nr_dim, sizeof(tmp[0]))
    203             ptr = &tmp[0]
    204             cv = <float[:128]>ptr

/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1147)()
     25         st = fread(dest, elem_size, number, self.fp)
     26         if st != number:
---> 27             raise IOError
     28 
     29     cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:

IOError:

Also I wanted to ask that how do I get the relevant freqs.json and strings.json for the trained vectors. For the strings.json, I have the batch outputs from merge_text.py. So they need to be mapped to the relevant information in freqs.json. If there is already a function that does it and I missed calling it, please let me know.

Python version: 2.7.11
Spacy version: 0.100.5

Error while using most_similar

s2v.most_similar(query_vector, 3)[0]
Traceback (most recent call last):
File "", line 1, in
File "sense2vec/vectors.pyx", line 154, in sense2vec.vectors.VectorMap.most_similar (sense2vec/vectors.cpp:4847)
File "spacy/strings.pyx", line 104, in spacy.strings.StringStore.getitem (spacy/strings.cpp:2339)
IndexError: 2065123256

error: '__pyx_v_ptr' declared as a pointer

Running install fails on Mac OS X 10.11 pip install -e .

Installing collected packages: sense2vec
  Running setup.py develop for sense2vec
    Complete output from command /usr/local/opt/python/bin/python2.7 -c "import setuptools, tokenize;__file__='/demo/python/spacySense2vec/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps:
    sense2vec/vectors.pyx has not changed
    Cythonizing sources
    running develop
    running egg_info
    writing requirements to sense2vec.egg-info/requires.txt
    writing sense2vec.egg-info/PKG-INFO
    writing top-level names to sense2vec.egg-info/top_level.txt
    writing dependency_links to sense2vec.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found
    
    reading manifest file 'sense2vec.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'sense2vec.egg-info/SOURCES.txt'
    running build_ext
    building 'sense2vec.vectors' extension
    creating build
    creating build/temp.macosx-10.11-x86_64-2.7
    creating build/temp.macosx-10.11-x86_64-2.7/sense2vec
    clang -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/demo/python/spacySense2vec/include -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c sense2vec/vectors.cpp -o build/temp.macosx-10.11-x86_64-2.7/sense2vec/vectors.o -O3 -Wno-unused-function -fno-stack-protector
    In file included from sense2vec/vectors.cpp:325:
    In file included from /demo/python/spacySense2vec/include/numpy/arrayobject.h:15:
    In file included from /demo/python/spacySense2vec/include/numpy/ndarrayobject.h:17:
    In file included from /demo/python/spacySense2vec/include/numpy/ndarraytypes.h:1728:
    /demo/python/spacySense2vec/include/numpy/npy_deprecated_api.h:11:2: warning: "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
    #warning "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION"
     ^
    sense2vec/vectors.cpp:7423:10: error: '__pyx_v_ptr' declared as a pointer to a reference of type 'float &'
      float &*__pyx_v_ptr;
             ^
    1 warning and 1 error generated.
    error: command 'clang' failed with exit status 1
    
    ----------------------------------------
Command "/usr/local/opt/python/bin/python2.7 -c "import setuptools, tokenize;__file__='/demo/python/spacySense2vec/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps" failed with error code 1 in /demo/python/spacySense2vec/

Segmentation fault with merge_text.py

Trying to train sense2vec on wikipedia.

after python -m spacy.en.download
when running sense2vec/bin/merge_text.py -b ~/input_dir ~/output_dir

where input_dir is a directory of cleaned wikipedia articles in plaintext files.

I get a segmentation fault in the worker threads, and 4 empty text files in output_dir

any ideas? Would it be possible to get a pre-trained, non-reddit model?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.