Giter Site home page Giter Site logo

Comments (4)

VHRanger avatar VHRanger commented on August 18, 2024

How are you saving and loading this bin file?

What methods and code did you use to be exact?

Looking at your bin file, there are spaces in the node names, but they're not limited by quotes, so it breaks the pseudo csv format embedding model files normally use (the separator between numbers in a line and the spaces in "The New York Times" is the same).

Ideally you'd put underscores in multi-word entities (what word2vec-googlenews does), or some other separator than the space you have in the number vector.

You could just fix the trained model file you have linked with a quick python or shell script, replacing the spaces in each line before you hit a number with some other char (or some quick script of the sort).

from nodevectors.

VHRanger avatar VHRanger commented on August 18, 2024

Hey,

I don't see your notebook attached but I'd be greatly interested in what you're doing. I imagine you're building a knowledge graph embedding from wikipedia articles and the links between them?

Not completely sure how to access the bin file in order to modify my node names.

Your bin file is actually just a text file with the first line being the number of graph nodes/embedding size and each line after is "node_name vec1 vec2 vec3" where node_name is the name of your concept (United States, World War II are the first ones) and vecN is the Nth embedding value.

Doing head wheel_model.bin on your file looks like this:

4806237 32
United States 6.8033214 -4.154616 0.012066513 11.332104 -16.76169 -19.428492 -10.781682 1.6716479 -0.19667558 1.3256662 -3.5415244 2.237211 -13.055762 5.908908 -2.7512574 14.582257 -0.12124324 -10.849494 -16.312693 0.2916756 -0.026202707 -1.9240215 12.621503 5.048701 -4.752299 2.0447419 0.21070565 -3.0716078 -1.6428103 10.187764 10.904518 14.997075
United Kingdom -3.7103581 -9.126546 1.4116981 -0.056045603 -2.0448332 -1.1038564 -0.86103773 -6.0579333 -2.976625 12.728374 -5.2228265 -4.3223863 -5.1425834 5.964745 -8.878074 -13.255045 -3.061953 8.424931 -4.71807 -2.1219532 -19.40228 -11.125764 4.3466306 11.592513 -20.929165 -6.405772 -2.067156 20.383396 -0.61012983 -6.948416 8.447513 -8.711377
World War II -2.8781173 4.932038 -16.562336 -1.0972513 10.062222 -1.1481029 -13.783848 -0.47825798 -2.9046717 3.2946844 -4.153537 0.7279581 1.5258105 -0.54257464 -6.9199524 6.8763733 -20.364853 9.290325 3.9864638 -4.5167356 -0.2601528 3.492558 13.922298 -4.532118 -4.7888575 -21.872889 1.8821391 -2.4022622 9.867455 4.495968 29.433992 -6.4929194
Germany 0.6580372 -6.628668 -0.43459854 -1.6681336 -1.8274205 -6.296602 -9.535266 -7.501447 1.0744739 2.68418 21.059107 -8.0889635 10.379657 -9.315827 -2.6443145 -14.056111 -4.5304785 8.186242 2.2545314 11.87788 1.93783 -13.7836075 4.9945726 -2.8565195 10.113838 -22.338263 13.395137 -13.977517 5.5553727 -0.88845044 10.984225 -5.724566

This means an easy way to modify it would be to do it with line iteration in python. Here's a sketch of a quick script I would write, using the fact that there's 32 dimensions to your vectors to split each line:

with open('new_model.bin', 'w' as fout:
    with open('wheel_model.bin', 'r') as fin:
        lines = fin.readlines()
        fout.write(lines[0])
        lines = lines[1:] # skip first line
        for l in lines:
            words = l.split(' ')
            vector = words [-32:]
            concept = [:-32]
            concept = '_'.join([x for x in concept])
            vector = ' '.join([x for x in vector])
            new_line = concept + ' ' + vector
            fout.write(new_line)

There might be bugs in this script, it's just a quick sketch of what I would do. But it would "fix" your currently broken file.

Just to be confirm, since my graph is directed, this limits Nodevector walks correct? Iā€™m sort of assuming I want to use a directed graph to try to embed Wikipedia articles.

Correct, the random walks will only take steps in the direction of directed edges for directed graphs. This is true both if you used NetworkX or the CSRGraphs backend to load the graph.

On my first run, I just went with default settings. 3. Any suggestions you might have on that?

Depends which model you're using.

If you're using Node2Vec you should play with walk length and window size especially. Longer walk lengths and larger windows train slower but create "deeper" embeddings. Touching the return_weight and neighbor_weight will make training drastically slower for large graphs and doesn't gain much performance. I don't recommend it.

You can also try other algorithms. GGVec (which is my creation) can be tried with order = 1 (faster, cruder) and order = 2 (much slower and much deeper) and the other parameters as recommended in the README. Another good one for large graphs is ProNE, and hyperparamaters don't change much on that one.

from nodevectors.

MrPaulAlbert avatar MrPaulAlbert commented on August 18, 2024

Fixed bin file to avoid spaces in node names. Working great!

Congrats on this package. Both Stanford SNAP C++ and Python Node2Vec choked on this dataset after running for days. Nodevector successfully completed task in 18 hours.

from nodevectors.

VHRanger avatar VHRanger commented on August 18, 2024

Good to hear!

I'll close the issue.

from nodevectors.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.