Giter Site home page Giter Site logo

cynricfu / magnn Goto Github PK

View Code? Open in Web Editor NEW
367.0 3.0 66.0 3.32 MB

Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

Python 67.97% Jupyter Notebook 32.03%
graph-neural-network network-embedding heterogeneous-network heterogeneous-graph heterogeneous-graph-neural-network

magnn's People

Contributors

cynricfu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

magnn's Issues

OOM kill

Thanks for sharing the code! When I use other nodes in the DBLP dataset for preprocessing to search for meta path instances, such as some operations like the preprocess_DBLP.ipynb file. But it has always required a large amount of memory. Do the author know the reason for this?

A question about etype list in link predition

I have a question about etype list in run_LastFM.py.

etypes_lists = [[[0, 1], [0, 2, 3, 1], [None]],
                [[1, 0], [2, 3], [1, None, 0]]]

expected_metapaths = [
    [(0, 1, 0), (0, 1, 2, 1, 0), (0, 0)],
    [(1, 0, 1), (1, 2, 1), (1, 0, 0, 1)]
]

Why is the etype of the (0,0) is None? Is it because the link of the same type of node is not important? If the node type is the same for the two end nodes of the links I want to predict, should I assign a type to (0,0)?

About Data Preprocessing

Hi, I have a question about the preprocessing in DBLP.

As I see that in the preprocess_DBLP.ipynb, you use data/raw/DBLP/DBLP4057_GAT_with_idx.mat from HAN.

Where could I find that file? And how to promise the corresponding index of that matrix is same with your data accordingly?

The preprocessing in DBLP, shows Memory out ?

Thank you very much for your sharing the code.
I have a question about the prepeocessing in DBLP. I used the preprocess_DBLP.ipynb to get the DBLP_processed file. But , it took a long tim and shows memory out error while processing the data. The laptop I use has 16 G RAM.
I wonder if my laptop RAM is low, or i have other mistakes in processing the data? How much memory of the RAM do you used to process DBLP dataset?
Looking forward to your reply.Thank you .

Link prediction

In link prediiction experiments, should parameters be shared in MAGNN_ctr_ntype_specific if the target nodes are the same type?

Metapath instances sampling

Hi , I need to use your sampling code in preprocess.py, and I want to know how long it takes to sample all the metapath instances for dblp dataset? I have run it on an adjM for a long time so that I‘m a little worried that I made something wrong.

Train - validation - test splits

Hi, can you share your train-validation-test splits? Since the sklearn version used for preprocessing is not mentioned, my PRNG may give a different split.

MemoryError

Hi, what is the reason for the memory error in the process of debugging Last_FM preprocessing code?

how to preprocess dblp dataset?

thank you for your work,I am completing a system using MAGCN, so I would like to know how to preprocess the xml file of dblp to process the xml file into multiple matrices including features.

中间结点的讨论

文中明确说了考虑到了元路径的中间结点信息,但是我确实不知道,您是怎么考虑的中间结点

Improvement Idea using Pretrained BERT word embedding model

Hi, I am recently studying graph neural networks, and was looking for some SOTA implementations using GNN.

From paperswithcode I got to know your work and got really impressed by your brilliant solutions to the problems that previous literature had.

Trying on my own using your IMDB codes, I came up with the idea of implementing a pre-trained embedding model to improve prediction performance.

so I used a pre-trained DistillBERT model to make node features instead of count vector, and the below is what I got as a result.

[DistillBERT]

SVM test
Macro-F1: 0.640466 ~ 0.014346 (0.8), 0.642629 ~ 0.007654 (0.6), 0.638584 ~ 0.005989 (0.4), 0.632471 ~ 0.004478 (0.2)
Micro-F1: 0.640230 ~ 0.014154 (0.8), 0.642098 ~ 0.007682 (0.6), 0.637949 ~ 0.006281 (0.4), 0.631836 ~ 0.004786 (0.2)
K-means test
NMI: 0.174531 ~ 0.000000
ARI: 0.162560 ~ 0.000000

SVM tests summary
Macro-F1: 0.640466 ~ 0.014346 (0.8), 0.642629 ~ 0.007654 (0.6), 0.638584 ~ 0.005989 (0.4), 0.632471 ~ 0.004478 (0.2)
Micro-F1: 0.640230 ~ 0.014154 (0.8), 0.642098 ~ 0.007682 (0.6), 0.637949 ~ 0.006281 (0.4), 0.631836 ~ 0.004786 (0.2)
K-means tests summary
NMI: 0.174531 ~ 0.000000
ARI: 0.162560 ~ 0.000000

and below is the original result I got when I ran the code.

[original result]

SVM test
Macro-F1: 0.601211 ~ 0.016628 (0.8), 0.602220 ~ 0.007302 (0.6), 0.598148 ~ 0.010544 (0.4), 0.591385 ~ 0.004841 (0.2)
Micro-F1: 0.600718 ~ 0.016712 (0.8), 0.602011 ~ 0.007650 (0.6), 0.598323 ~ 0.010250 (0.4), 0.591484 ~ 0.004161 (0.2)
K-means test
NMI: 0.151366 ~ 0.000000
ARI: 0.159534 ~ 0.000000

SVM tests summary
Macro-F1: 0.601211 ~ 0.016628 (0.8), 0.602220 ~ 0.007302 (0.6), 0.598148 ~ 0.010544 (0.4), 0.591385 ~ 0.004841 (0.2)
Micro-F1: 0.600718 ~ 0.016712 (0.8), 0.602011 ~ 0.007650 (0.6), 0.598323 ~ 0.010250 (0.4), 0.591484 ~ 0.004161 (0.2)
K-means tests summary
NMI: 0.151366 ~ 0.000000
ARI: 0.159534 ~ 0.000000

I think you might have known about this approach, but I just wanted to let you know this!

If you are interested, I'll provide the full code to you so you could refer.

Thanks again for this wonderful paper and the implementation codes.

Mattias

Why is the edge from dst to src in DGL graph?

Hi~ I have a question when cosntructing the DGL graph.

edges.append((row_parsed[0], dst))

According to the preprocessing code mentioned above in parse_adjlist(adjlist, edge_metapath_indices, samples=None), elements in edges[i] should be (src, dst). But when constructing the DGL graph, it seems that you add an edge from dst to src.

g.add_edges(*list(zip(*[(edges[i][1], edges[i][0]) for i in sorted_index])))

I don’t quite understand this. Could you tell me something about that? Thank you very much!

How to perform link prediction (user-artist pair on Last.fm) for homogeneous methods like LINE and GCN?

Hi, we appreciate your work.
I am confused about how to perform link prediction for homogeneous methods like LINE, node2vec, GCN and GAT. They can only learn homogeneous graphs, i.e. they can only learn one type of object embedding, but the user-artist links in Last.fm are heterogeneous, which connect two different types of objects. As well, HAN is only able to learn the embeddings for one type of target objects.
Thanks.

Could you publish the code of the baseline?

Hi,
I am doing some tests on some baselines, such as GAT and GCN, on the preprocessed DBLP dataset.
I am using metapath APCPA to construct a metapath-based homogeneous graph and conduct a GAT layer on it with the hyperparameter settings reported in MAGNN.
But I got very different result, showing:

SVM test
Macro-F1: 0.938165~0.004259 (0.8), 0.937600~0.004730 (0.6), 0.937070~0.002117 (0.4), 0.933790~0.003152 (0.2)
Micro-F1: 0.942638~0.004183 (0.8), 0.942210~0.004119 (0.6), 0.941688~0.002020 (0.4), 0.938910~0.003140 (0.2)
K-means test
NMI: 0.778665~0.000000
ARI: 0.837312~0.000000

I think that this is because some how I introduced extra information that should not be involved and it makes the result higher, so I wonder that if the code for Baseline is available?
I would be very appreciate if you could share the code of baslines! 👍 :)

what does etype_list mean?

when I run the run_DBLP, I am confused what does "etypes_list = [[0, 1], [0, 2, 3, 1], [0, 4, 5, 1]]" mean.
I see the closed question #5. But I also want to know where the code define the edge index number.

About the graph embedding baselines

Thanks for sharing the code! I am very interested in your experimental scheme! I encountered some problems with the HERec method's link prediction implementation during the implementation, so I was wondering if you could share your code for implementing data preprocessing and models under this task. Looking forward to your reply and tips!

Mask the positive train edges

mask = [False if [u1, a1 - offset] in exclude or [u2, a2 - offset] in exclude else True for u1, a1, u2, a2 in indices[:, [0, 1, -1, -2]]]

Why exclude the positive train edges in the function parse_minibatch_LastFM?

Refactoring the code

Hi, I wish to refactor the MAGNN code for my research project.
However, I found that the Link Prediction and Node Classification Tasks used different models (MAGNN_lp, MAGNN_nc, MAGNN_nc_mb) in your implementation.
So if I wish to produce a generic MAGNN model in an unsupervised setting that can support multi-layer, mini-batch training over more than one type of edges (instead of just user-artist in LastFM), how could I do that based on your implementation? Could you please give me some directions?
More specifically, why did you set use_minibatch=False in MAGNN_nc? What will happen if I simply change it to True?
Thanks!

Run link prediction on LastFM data set

How long does it take for python run_LastFM.py to run? And how to you set the parameters (are you using the default values, such as batch_size = 8, which seem to be a very small number)?

Lastfm dataset preprocess error

when I ran the preprocess_LastFM.ipynb to process the dataset for link prediction, there was something wrong.

image

I coundn't find how to produce the ''data/preprocessed/LastFM_processed/user_artist.npy'' in the codes. Please help me with the problem.

what is the meaning of 'use_masks' and 'no_masks'?

Thanks for opening the source code

I'm confused about the code in run_LastFM.py

MAGNN/run_LastFM.py

Lines 21 to 23 in b8557f5

use_masks = [[True, True, False],
[True, False, True]]
no_masks = [[False] * 3, [False] * 3]

what is the meaning of 'use_masks' and 'no_masks'?

it's parameter in parse_adjlist_LastFM

def parse_adjlist_LastFM(adjlist, edge_metapath_indices, samples=None, exclude=None, offset=None, mode=None):

Looking forward to your recovery

Question about data splitting

Hello, I have read your WWW20 paper and your code recently. It's an interesting work.
I currently have one question about the data split process in your paper.

You mentioned how you split data for semi-supervised learning models
image
Q1: Why do you have other "Train %" column in Table-3? You don't follow abovementioned data split? Or you use abovementioned semi-sup learning setting to get the embedding and further use the training rate setting in Table-3 to train another classifier? but if in this case, could we still call it as semi-supervised learning?
Q2, How do you run GNN baselines, e.g., GCN GAT? and in Table-3 they correspond to different training rate as well. It's confusing.

etype list

etypes_list = [[0, 1], [0, 2, 3, 1], [0, 4, 5, 1]]

Can you elaborate what does etypes_list signifies, and how do you consider the following list? Thanks.

Having trouble in understanding the data format

Hi, I tried to dive into the source code of load_DBLP_data() function and print the intermediate variables. However, I still have trouble in understanding the data format. Could you give some explanation? Thank you!

How to deal with nodes with no neighbor for a given metapath?

I face the error when I run my own dataset:

Traceback (most recent call last):
  File "run_mydata.py", line 223, in <module>
    args.epoch, args.patience, args.batch_size, args.samples, args.repeat, args.save_postfix)
  File "run_mydata.py", line 96, in run_model_DBLP
    adjlists, edge_metapath_indices_list, train_idx_batch, device, neighbor_samples)
  File "/MAGNN-master/utils/tools.py", line 112, in parse_minibatch
    [adjlist[i] for i in idx_batch], [indices[i] for i in idx_batch], samples)
  File "/MAGNN-master/utils/tools.py", line 72, in parse_adjlist
    num = len(edge_metapath_indices[0][0])
IndexError: index 0 is out of bounds for axis 0 with size 0

I think maybe it's because the code restrict that any node on any meta-path should have neighbors. Could you give me some advice how to fix it?

id偏移问题请教

在FM数据集处理时,user和artist在处理流程和采样流程中反复进行偏移,为什么不把user和artist和其它节点统一编码,从头到尾一个节点只有唯一的id呢

Loss - only supervised?

In run_DBLP.py, I see that the model is only trained with supervised loss. Can you please confirm this as the paper also mentions an unsupervised loss?

Last.fm is too slow when training, how can I use a smaller Last.fm?

Last.fm is too slow when training because of its size. It takes about 2 hours for one epoch, which make it almost impossible for me to use gridsearch to get a good or better result.
It is a good idea if I can use a smaller Last.fm or a part of it. Is it possible?
I tried change num_user = 1892, num_artist = 17632, num_tag = 11945 to smaller values in preprocess_LastFM.ipynb, but some errors occur.

Can someone tell me, is there any easy way to use a smaller Last.fm. thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.