I'm getting the following error when trying to load <a href="http://nlp.stanford.edu/d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Error loading nlp.stanford.edu vectors about glove-python HOT 4 OPEN

maciejkula commented on July 27, 2024

Error loading nlp.stanford.edu vectors

from glove-python.

Comments (4)

bjornamr commented on July 27, 2024

did anyone solve this?

from glove-python.

bjornamr commented on July 27, 2024

@thomasj02 I solved it. This is not a clean solution, but it worked on python 3.
It will use a lot of ram, so I will advice not running anything RAM heavy on the side.

` @classmethod
def load_stanford(cls, filename):
"""
Load model from the output files generated by
the C code from http://nlp.stanford.edu/projects/glove/.

    The entries of the word dictionary will be of type
    unicode in Python 2 and str in Python 3.
    """

    dct = {}
    #vectors = array.array('d')
    vectors = []
    # Read in the data.
    temp_array = []
    vector_size = 0
    with io.open(filename, 'r', encoding='utf-8') as savefile:
        for i, line in enumerate(savefile):
            tokens = line.split(' ')
            word = tokens[0]
            entries = tokens[1:]
            vector_size = len(entries)
            dct[word] = i
            #vectors.extend(float(x) for x in entries)
            vectors.append([float(x) for x in entries])
            #temp_array.append([float(x) for x in entries])
            print("temp_array", len(temp_array))

    # Infer word vectors dimensions.
    print("dct keys",len(dct.keys()))
    no_components = len(vectors)


    # Set up the model instance.
    instance = Glove()
    instance.no_components = no_components
    word_vec = np.memmap("word_vec", dtype=np.float32, mode="w+", shape=(len(vectors),vector_size))
    word_vec[:] = vectors[:]
    instance.word_vectors = word_vec
    #instance.word_vectors[:] = np.array(vectors).reshape(no_vectors,no_components)
    print("word_vec_new", instance.word_vectors.shape)
    instance.word_biases = np.memmap("word_biases", dtype=np.float32, mode="w+", shape=len(vectors))
    print("word_biases", instance.word_biases.shape)

    instance.add_dictionary(dct)

    return instance`

from glove-python.

sp4ghet commented on July 27, 2024

It looks like there are some unknowns in the original corpus, which means the total size of vectors is different from num_words * dimensions and reshape won't work.

Adding a little catch for <unk> in the case of twitter corpus helped for me.
This is the glove.py file for the Glove class.
https://github.com/maciejkula/glove-python/blob/master/glove/glove.py#L235

    @classmethod
    def load_stanford(cls, filename):
        """
        Load model from the output files generated by
        the C code from http://nlp.stanford.edu/projects/glove/.

        The entries of the word dictionary will be of type
        unicode in Python 2 and str in Python 3.
        """

        dct = {}
        vectors = array.array('d')

        # Read in the data.
        with io.open(filename, 'r', encoding='utf-8') as savefile:
            for i, line in enumerate(savefile):
                tokens = line.split(' ')

                word = tokens[0]
                entries = tokens[1:]
################# This part
                if word == '<unk>':
                    continue
#################
                dct[word] = i
                vectors.extend([float(x) for x in entries])

        # Infer word vectors dimensions.
        no_components = len(entries)
        no_vectors = len(dct)

        # Set up the model instance.
        instance = Glove()
        instance.no_components = no_components
        instance.word_vectors = (np.array(vectors)
                                 .reshape(no_vectors,
                                          no_components))
        instance.word_biases = np.zeros(no_vectors)
        instance.add_dictionary(dct)

        return instance

from glove-python.

sp4ghet commented on July 27, 2024

Actually, on second thought it should probably be less hardcoded and be more like

if word in dct.keys():

from glove-python.

Error loading nlp.stanford.edu vectors about glove-python HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent