cynricfu / magnn Goto Github PK

View Code? Open in Web Editor NEW

367.0 3.0 66.0 3.32 MB

Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

Python 67.97% Jupyter Notebook 32.03%

graph-neural-network network-embedding heterogeneous-network heterogeneous-graph heterogeneous-graph-neural-network

magnn's People

Contributors

Stargazers

Watchers

Forkers

seongjinahn lighteningzhang francis-zheng baobunuo rt35ntrp izhaojinlong myeonghak tmlog liun-online rbozydar yellow-binary-tree mysoulmq changzhijiang charlescheung96 jerry185 vickycs50 pantapps nmvdlei caplimbo emmmmmboom lspongebobjh cspjiao jianzhu bywmm bzhaoland guangxinsuuu aishwaryakhatwani zuoxijunxifu cqmeng ychuest libingixn kahei-cjp elijahahianyo a1727236594 an9iekim hx804722948 q-m-d baishuotong fuliuwei cirydancer cranooooooo haross pinglanchu johnsonzxchang kz-code techthiyanes avudzor shenjiangqiu batman0911 ml-wang tbilab bbeechu cover17 tigerkey10 coffeebeanustb wyq-carol liesgame ziqiangfu namtran2299 shayne98 harishgovardhandamodar abouelkhair5 fsonya88 chenxingqiang

magnn's Issues

OOM kill

Thanks for sharing the code! When I use other nodes in the DBLP dataset for preprocessing to search for meta path instances, such as some operations like the preprocess_DBLP.ipynb file. But it has always required a large amount of memory. Do the author know the reason for this?

A question about etype list in link predition

I have a question about etype list in run_LastFM.py.

etypes_lists = [[[0, 1], [0, 2, 3, 1], [None]],
                [[1, 0], [2, 3], [1, None, 0]]]

expected_metapaths = [
    [(0, 1, 0), (0, 1, 2, 1, 0), (0, 0)],
    [(1, 0, 1), (1, 2, 1), (1, 0, 0, 1)]
]

Why is the etype of the (0,0) is None? Is it because the link of the same type of node is not important? If the node type is the same for the two end nodes of the links I want to predict, should I assign a type to (0,0)?

About Data Preprocessing

Hi, I have a question about the preprocessing in DBLP.

As I see that in the preprocess_DBLP.ipynb, you use data/raw/DBLP/DBLP4057_GAT_with_idx.mat from HAN.

Where could I find that file? And how to promise the corresponding index of that matrix is same with your data accordingly?

Where can I download DBLP4057_GAT_with_idx.mat from?

This file is used in preprocess_DBLP.ipynb and I can't get it from the referenced repo. The referenced repo https://github.com/Jhy1993/HAN has a link to preprocessed DBLP but the link takes me to a website and I m not sure how to download DBLP4057_GAT_with_idx.mat from baidu while being in the US.

The preprocessing in DBLP, shows Memory out ?

Thank you very much for your sharing the code.
I have a question about the prepeocessing in DBLP. I used the preprocess_DBLP.ipynb to get the DBLP_processed file. But , it took a long tim and shows memory out error while processing the data. The laptop I use has 16 G RAM.
I wonder if my laptop RAM is low, or i have other mistakes in processing the data? How much memory of the RAM do you used to process DBLP dataset?
Looking forward to your reply.Thank you .

Link prediction

In link prediiction experiments, should parameters be shared in MAGNN_ctr_ntype_specific if the target nodes are the same type?

Where do you get the metadata?

I would like to ask where do you get these three metadata. Thank you very much!

Metapath instances sampling

Hi , I need to use your sampling code in preprocess.py, and I want to know how long it takes to sample all the metapath instances for dblp dataset? I have run it on an adjM for a long time so that I‘m a little worried that I made something wrong.

Train - validation - test splits

Hi, can you share your train-validation-test splits? Since the sklearn version used for preprocessing is not mentioned, my PRNG may give a different split.

How can I modify the code so that only MAGNN is used to get the node embedding representation

Thanks for sharing the code, I would like to use the MAGNN graph embedding algorithm to get the embedded representation of the nodes, and then do the follow-up work with other models. When I use base_MAGNN.py, do I also need to process the data into the form of your preprocessing? Could you please give me some ideas? Thank you

MemoryError

Hi, what is the reason for the memory error in the process of debugging Last_FM preprocessing code?

I can't download your DBLP preprocessed datasets.

I can't download your DBLP preprocessed datasets, Can you send it to me by email [email protected]？ Thank you so much!

how to preprocess dblp dataset?

thank you for your work,I am completing a system using MAGCN, so I would like to know how to preprocess the xml file of dblp to process the xml file into multiple matrices including features.

中间结点的讨论

文中明确说了考虑到了元路径的中间结点信息，但是我确实不知道，您是怎么考虑的中间结点

I have a question about the Params

What do etypes_lists ，use_masks，use_masks mean？

Did you offer an unsupervised version？

I have noticed that you talked about the way for unsupervised learning. Did you offer an unsupervised version in this repo?

I have a question about the model

What is the difference MAGNN_nc_mb.py and MAGNN_nc.py and MAGNN_lp.py?

Improvement Idea using Pretrained BERT word embedding model

Hi, I am recently studying graph neural networks, and was looking for some SOTA implementations using GNN.

From paperswithcode I got to know your work and got really impressed by your brilliant solutions to the problems that previous literature had.

Trying on my own using your IMDB codes, I came up with the idea of implementing a pre-trained embedding model to improve prediction performance.

so I used a pre-trained DistillBERT model to make node features instead of count vector, and the below is what I got as a result.

[DistillBERT]

SVM test
Macro-F1: 0.640466 ~ 0.014346 (0.8), 0.642629 ~ 0.007654 (0.6), 0.638584 ~ 0.005989 (0.4), 0.632471 ~ 0.004478 (0.2)
Micro-F1: 0.640230 ~ 0.014154 (0.8), 0.642098 ~ 0.007682 (0.6), 0.637949 ~ 0.006281 (0.4), 0.631836 ~ 0.004786 (0.2)
K-means test
NMI: 0.174531 ~ 0.000000
ARI: 0.162560 ~ 0.000000

SVM tests summary
Macro-F1: 0.640466 ~ 0.014346 (0.8), 0.642629 ~ 0.007654 (0.6), 0.638584 ~ 0.005989 (0.4), 0.632471 ~ 0.004478 (0.2)
Micro-F1: 0.640230 ~ 0.014154 (0.8), 0.642098 ~ 0.007682 (0.6), 0.637949 ~ 0.006281 (0.4), 0.631836 ~ 0.004786 (0.2)
K-means tests summary
NMI: 0.174531 ~ 0.000000
ARI: 0.162560 ~ 0.000000

and below is the original result I got when I ran the code.

[original result]

SVM test
Macro-F1: 0.601211 ~ 0.016628 (0.8), 0.602220 ~ 0.007302 (0.6), 0.598148 ~ 0.010544 (0.4), 0.591385 ~ 0.004841 (0.2)
Micro-F1: 0.600718 ~ 0.016712 (0.8), 0.602011 ~ 0.007650 (0.6), 0.598323 ~ 0.010250 (0.4), 0.591484 ~ 0.004161 (0.2)
K-means test
NMI: 0.151366 ~ 0.000000
ARI: 0.159534 ~ 0.000000

SVM tests summary
Macro-F1: 0.601211 ~ 0.016628 (0.8), 0.602220 ~ 0.007302 (0.6), 0.598148 ~ 0.010544 (0.4), 0.591385 ~ 0.004841 (0.2)
Micro-F1: 0.600718 ~ 0.016712 (0.8), 0.602011 ~ 0.007650 (0.6), 0.598323 ~ 0.010250 (0.4), 0.591484 ~ 0.004161 (0.2)
K-means tests summary
NMI: 0.151366 ~ 0.000000
ARI: 0.159534 ~ 0.000000

I think you might have known about this approach, but I just wanted to let you know this!

If you are interested, I'll provide the full code to you so you could refer.

Thanks again for this wonderful paper and the implementation codes.

Mattias

How to set parameter to achieve the result in the paper

I run the code of IMDB with default parameter with repeat=10, the result is good,but not as good as in the paper. how i can get the result as good as in the paper?

How to project feature vectors into the same latent factor space in Equation 1 of the paper ?

Thank you so much for bringing such a wonderful paper. But I have no idea about applying a type-specific linear transformation for each type of nodes by projecting feature vectors into the same latent factor space, which appears in Equation 1 of the paper.
I would appreciate it if you could give me a hint or tell me the answer. Thank you very much.

What does etypes_list variable mean?

such as [[0,1], [0,2,3,1], [0,4,5,1]] in the run_DBLP.py

If I want to run MAGNN on a new dataset. How should I set the etypes_list value?

Why is the edge from dst to src in DGL graph?

Hi~ I have a question when cosntructing the DGL graph.

MAGNN/utils/tools.py

Line 97 in 144f39a

edges.append((row_parsed[0], dst))

According to the preprocessing code mentioned above in parse_adjlist(adjlist, edge_metapath_indices, samples=None), elements in edges[i] should be (src, dst). But when constructing the DGL graph, it seems that you add an edge from dst to src.

MAGNN/utils/tools.py

Line 116 in 144f39a

g.add_edges(*list(zip(*[(edges[i][1], edges[i][0]) for i in sorted_index])))

I don’t quite understand this. Could you tell me something about that? Thank you very much!

How to perform link prediction (user-artist pair on Last.fm) for homogeneous methods like LINE and GCN?

Hi, we appreciate your work.
I am confused about how to perform link prediction for homogeneous methods like LINE, node2vec, GCN and GAT. They can only learn homogeneous graphs, i.e. they can only learn one type of object embedding, but the user-artist links in Last.fm are heterogeneous, which connect two different types of objects. As well, HAN is only able to learn the embeddings for one type of target objects.
Thanks.

Does anyone have tensorflow version?thanks!!!

Could you publish the code of the baseline?

Hi,
I am doing some tests on some baselines, such as GAT and GCN, on the preprocessed DBLP dataset.
I am using metapath APCPA to construct a metapath-based homogeneous graph and conduct a GAT layer on it with the hyperparameter settings reported in MAGNN.
But I got very different result, showing:

SVM test
Macro-F1: 0.938165~0.004259 (0.8), 0.937600~0.004730 (0.6), 0.937070~0.002117 (0.4), 0.933790~0.003152 (0.2)
Micro-F1: 0.942638~0.004183 (0.8), 0.942210~0.004119 (0.6), 0.941688~0.002020 (0.4), 0.938910~0.003140 (0.2)
K-means test
NMI: 0.778665~0.000000
ARI: 0.837312~0.000000

I think that this is because some how I introduced extra information that should not be involved and it makes the result higher, so I wonder that if the code for Baseline is available?
I would be very appreciate if you could share the code of baslines! 👍 :)

what does etype_list mean?

when I run the run_DBLP, I am confused what does "etypes_list = [[0, 1], [0, 2, 3, 1], [0, 4, 5, 1]]" mean.
I see the closed question #5. But I also want to know where the code define the edge index number.

About the graph embedding baselines

Thanks for sharing the code! I am very interested in your experimental scheme! I encountered some problems with the HERec method's link prediction implementation during the implementation, so I was wondering if you could share your code for implementing data preprocessing and models under this task. Looking forward to your reply and tips!

Mask the positive train edges

MAGNN/utils/tools.py

Line 141 in 144f39a

    
           mask = [False if [u1, a1 - offset] in exclude or [u2, a2 - offset] in exclude else True for u1, a1, u2, a2 in indices[:, [0, 1, -1, -2]]]

Why exclude the positive train edges in the function parse_minibatch_LastFM?

The issue about edges list

I've never quite understood what the edges-list definition means, for which module in the paper？

Refactoring the code

Hi, I wish to refactor the MAGNN code for my research project.
However, I found that the Link Prediction and Node Classification Tasks used different models (MAGNN_lp, MAGNN_nc, MAGNN_nc_mb) in your implementation.
So if I wish to produce a generic MAGNN model in an unsupervised setting that can support multi-layer, mini-batch training over more than one type of edges (instead of just user-artist in LastFM), how could I do that based on your implementation? Could you please give me some directions?
More specifically, why did you set use_minibatch=False in MAGNN_nc? What will happen if I simply change it to True?
Thanks!

Run link prediction on LastFM data set

How long does it take for python run_LastFM.py to run? And how to you set the parameters (are you using the default values, such as batch_size = 8, which seem to be a very small number)?

Lastfm dataset preprocess error

when I ran the preprocess_LastFM.ipynb to process the dataset for link prediction, there was something wrong.

I coundn't find how to produce the ''data/preprocessed/LastFM_processed/user_artist.npy'' in the codes. Please help me with the problem.

what is the meaning of 'use_masks' and 'no_masks'?

Thanks for opening the source code

I'm confused about the code in run_LastFM.py

MAGNN/run_LastFM.py

Lines 21 to 23 in b8557f5

    
           use_masks = [[True, True, False], 
        
                        [True, False, True]] 
        
           no_masks = [[False] * 3, [False] * 3]

what is the meaning of 'use_masks' and 'no_masks'?

it's parameter in parse_adjlist_LastFM

MAGNN/utils/tools.py

Line 129 in b8557f5

    
           def parse_adjlist_LastFM(adjlist, edge_metapath_indices, samples=None, exclude=None, offset=None, mode=None):

Looking forward to your recovery

Question about data splitting

Hello, I have read your WWW20 paper and your code recently. It's an interesting work.
I currently have one question about the data split process in your paper.

You mentioned how you split data for semi-supervised learning models

Q1: Why do you have other "Train %" column in Table-3? You don't follow abovementioned data split? Or you use abovementioned semi-sup learning setting to get the embedding and further use the training rate setting in Table-3 to train another classifier? but if in this case, could we still call it as semi-supervised learning?
Q2, How do you run GNN baselines, e.g., GCN GAT? and in Table-3 they correspond to different training rate as well. It's confusing.

etype list

MAGNN/run_DBLP.py

Line 18 in 144f39a

etypes_list = [[0, 1], [0, 2, 3, 1], [0, 4, 5, 1]]

Can you elaborate what does etypes_list signifies, and how do you consider the following list? Thanks.

Is there any processed DBLP, IMDB dataset you've made for link prediction?

Or can you tell me how you processed the datasets?

Having trouble in understanding the data format

Hi, I tried to dive into the source code of load_DBLP_data() function and print the intermediate variables. However, I still have trouble in understanding the data format. Could you give some explanation? Thank you!

How to deal with nodes with no neighbor for a given metapath?

I face the error when I run my own dataset:

Traceback (most recent call last):
  File "run_mydata.py", line 223, in <module>
    args.epoch, args.patience, args.batch_size, args.samples, args.repeat, args.save_postfix)
  File "run_mydata.py", line 96, in run_model_DBLP
    adjlists, edge_metapath_indices_list, train_idx_batch, device, neighbor_samples)
  File "/MAGNN-master/utils/tools.py", line 112, in parse_minibatch
    [adjlist[i] for i in idx_batch], [indices[i] for i in idx_batch], samples)
  File "/MAGNN-master/utils/tools.py", line 72, in parse_adjlist
    num = len(edge_metapath_indices[0][0])
IndexError: index 0 is out of bounds for axis 0 with size 0

I think maybe it's because the code restrict that any node on any meta-path should have neighbors. Could you give me some advice how to fix it?

what does etype_list mean in LastFM？

I don’t understand how etype_list is defined.If there are four types of nodes in the LastFM project, is there no way to define etype_list?

id偏移问题请教

在FM数据集处理时，user和artist在处理流程和采样流程中反复进行偏移，为什么不把user和artist和其它节点统一编码，从头到尾一个节点只有唯一的id呢

Loss - only supervised?

In run_DBLP.py, I see that the model is only trained with supervised loss. Can you please confirm this as the paper also mentions an unsupervised loss?

Last.fm is too slow when training, how can I use a smaller Last.fm?

Last.fm is too slow when training because of its size. It takes about 2 hours for one epoch, which make it almost impossible for me to use gridsearch to get a good or better result.
It is a good idea if I can use a smaller Last.fm or a part of it. Is it possible?
I tried change num_user = 1892, num_artist = 17632, num_tag = 11945 to smaller values in preprocess_LastFM.ipynb, but some errors occur.

Can someone tell me, is there any easy way to use a smaller Last.fm. thanks a lot.

	use_masks = [[True, True, False],
	[True, False, True]]
	no_masks = [[False] * 3, [False] * 3]

cynricfu / magnn Goto Github PK

magnn's People

Contributors

Stargazers

Watchers

Forkers

magnn's Issues

Recommend Projects

Recommend Topics

Recommend Org