cynricfu / magnn Goto Github PK
View Code? Open in Web Editor NEWMetapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding
Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding
Thanks for sharing the code! When I use other nodes in the DBLP dataset for preprocessing to search for meta path instances, such as some operations like the preprocess_DBLP.ipynb file. But it has always required a large amount of memory. Do the author know the reason for this?
I have a question about etype list in run_LastFM.py.
etypes_lists = [[[0, 1], [0, 2, 3, 1], [None]],
[[1, 0], [2, 3], [1, None, 0]]]
expected_metapaths = [
[(0, 1, 0), (0, 1, 2, 1, 0), (0, 0)],
[(1, 0, 1), (1, 2, 1), (1, 0, 0, 1)]
]
Why is the etype of the (0,0)
is None? Is it because the link of the same type of node is not important? If the node type is the same for the two end nodes of the links I want to predict, should I assign a type to (0,0)
?
Hi, I have a question about the preprocessing in DBLP.
As I see that in the preprocess_DBLP.ipynb
, you use data/raw/DBLP/DBLP4057_GAT_with_idx.mat
from HAN.
Where could I find that file? And how to promise the corresponding index of that matrix is same with your data accordingly?
This file is used in preprocess_DBLP.ipynb and I can't get it from the referenced repo. The referenced repo https://github.com/Jhy1993/HAN has a link to preprocessed DBLP but the link takes me to a website and I m not sure how to download DBLP4057_GAT_with_idx.mat from baidu while being in the US.
Thank you very much for your sharing the code.
I have a question about the prepeocessing in DBLP. I used the preprocess_DBLP.ipynb to get the DBLP_processed file. But , it took a long tim and shows memory out error while processing the data. The laptop I use has 16 G RAM.
I wonder if my laptop RAM is low, or i have other mistakes in processing the data? How much memory of the RAM do you used to process DBLP dataset?
Looking forward to your reply.Thank you .
In link prediiction experiments, should parameters be shared in MAGNN_ctr_ntype_specific if the target nodes are the same type?
I would like to ask where do you get these three metadata. Thank you very much!
Hi , I need to use your sampling code in preprocess.py, and I want to know how long it takes to sample all the metapath instances for dblp dataset? I have run it on an adjM for a long time so that I‘m a little worried that I made something wrong.
Hi, can you share your train-validation-test splits? Since the sklearn version used for preprocessing is not mentioned, my PRNG may give a different split.
Thanks for sharing the code, I would like to use the MAGNN graph embedding algorithm to get the embedded representation of the nodes, and then do the follow-up work with other models. When I use base_MAGNN.py, do I also need to process the data into the form of your preprocessing? Could you please give me some ideas? Thank you
Hi, what is the reason for the memory error in the process of debugging Last_FM preprocessing code?
I can't download your DBLP preprocessed datasets, Can you send it to me by email [email protected]? Thank you so much!
thank you for your work,I am completing a system using MAGCN, so I would like to know how to preprocess the xml file of dblp to process the xml file into multiple matrices including features.
文中明确说了考虑到了元路径的中间结点信息,但是我确实不知道,您是怎么考虑的中间结点
What do etypes_lists ,use_masks,use_masks mean?
I have noticed that you talked about the way for unsupervised learning. Did you offer an unsupervised version in this repo?
What is the difference MAGNN_nc_mb.py and MAGNN_nc.py and MAGNN_lp.py?
Hi, I am recently studying graph neural networks, and was looking for some SOTA implementations using GNN.
From paperswithcode I got to know your work and got really impressed by your brilliant solutions to the problems that previous literature had.
Trying on my own using your IMDB codes, I came up with the idea of implementing a pre-trained embedding model to improve prediction performance.
so I used a pre-trained DistillBERT model to make node features instead of count vector, and the below is what I got as a result.
[DistillBERT]
SVM tests summary
Macro-F1: 0.640466 ~ 0.014346 (0.8), 0.642629 ~ 0.007654 (0.6), 0.638584 ~ 0.005989 (0.4), 0.632471 ~ 0.004478 (0.2)
Micro-F1: 0.640230 ~ 0.014154 (0.8), 0.642098 ~ 0.007682 (0.6), 0.637949 ~ 0.006281 (0.4), 0.631836 ~ 0.004786 (0.2)
K-means tests summary
NMI: 0.174531 ~ 0.000000
ARI: 0.162560 ~ 0.000000
and below is the original result I got when I ran the code.
[original result]
SVM tests summary
Macro-F1: 0.601211 ~ 0.016628 (0.8), 0.602220 ~ 0.007302 (0.6), 0.598148 ~ 0.010544 (0.4), 0.591385 ~ 0.004841 (0.2)
Micro-F1: 0.600718 ~ 0.016712 (0.8), 0.602011 ~ 0.007650 (0.6), 0.598323 ~ 0.010250 (0.4), 0.591484 ~ 0.004161 (0.2)
K-means tests summary
NMI: 0.151366 ~ 0.000000
ARI: 0.159534 ~ 0.000000
I think you might have known about this approach, but I just wanted to let you know this!
If you are interested, I'll provide the full code to you so you could refer.
Thanks again for this wonderful paper and the implementation codes.
Mattias
I run the code of IMDB with default parameter with repeat=10, the result is good,but not as good as in the paper. how i can get the result as good as in the paper?
Thank you so much for bringing such a wonderful paper. But I have no idea about applying a type-specific linear transformation for each type of nodes by projecting feature vectors into the same latent factor space, which appears in Equation 1 of the paper.
I would appreciate it if you could give me a hint or tell me the answer. Thank you very much.
such as [[0,1], [0,2,3,1], [0,4,5,1]]
in the run_DBLP.py
If I want to run MAGNN on a new dataset. How should I set the etypes_list value?
Hi~ I have a question when cosntructing the DGL graph.
Line 97 in 144f39a
parse_adjlist(adjlist, edge_metapath_indices, samples=None)
, elements in edges[i] should be (src, dst). But when constructing the DGL graph, it seems that you add an edge from dst to src.
Line 116 in 144f39a
I don’t quite understand this. Could you tell me something about that? Thank you very much!
Hi, we appreciate your work.
I am confused about how to perform link prediction for homogeneous methods like LINE, node2vec, GCN and GAT. They can only learn homogeneous graphs, i.e. they can only learn one type of object embedding, but the user-artist links in Last.fm are heterogeneous, which connect two different types of objects. As well, HAN is only able to learn the embeddings for one type of target objects.
Thanks.
Hi,
I am doing some tests on some baselines, such as GAT and GCN, on the preprocessed DBLP dataset.
I am using metapath APCPA
to construct a metapath-based homogeneous graph and conduct a GAT layer on it with the hyperparameter settings reported in MAGNN.
But I got very different result, showing:
SVM test
Macro-F1: 0.938165~0.004259 (0.8), 0.937600~0.004730 (0.6), 0.937070~0.002117 (0.4), 0.933790~0.003152 (0.2)
Micro-F1: 0.942638~0.004183 (0.8), 0.942210~0.004119 (0.6), 0.941688~0.002020 (0.4), 0.938910~0.003140 (0.2)
K-means test
NMI: 0.778665~0.000000
ARI: 0.837312~0.000000
I think that this is because some how I introduced extra information that should not be involved and it makes the result higher, so I wonder that if the code for Baseline is available?
I would be very appreciate if you could share the code of baslines! 👍 :)
when I run the run_DBLP, I am confused what does "etypes_list = [[0, 1], [0, 2, 3, 1], [0, 4, 5, 1]]" mean.
I see the closed question #5. But I also want to know where the code define the edge index number.
Thanks for sharing the code! I am very interested in your experimental scheme! I encountered some problems with the HERec method's link prediction implementation during the implementation, so I was wondering if you could share your code for implementing data preprocessing and models under this task. Looking forward to your reply and tips!
Line 141 in 144f39a
I've never quite understood what the edges-list definition means, for which module in the paper?
Hi, I wish to refactor the MAGNN code for my research project.
However, I found that the Link Prediction and Node Classification Tasks used different models (MAGNN_lp, MAGNN_nc, MAGNN_nc_mb) in your implementation.
So if I wish to produce a generic MAGNN model in an unsupervised setting that can support multi-layer, mini-batch training over more than one type of edges (instead of just user-artist in LastFM), how could I do that based on your implementation? Could you please give me some directions?
More specifically, why did you set use_minibatch=False in MAGNN_nc? What will happen if I simply change it to True?
Thanks!
How long does it take for python run_LastFM.py
to run? And how to you set the parameters (are you using the default values, such as batch_size = 8, which seem to be a very small number)?
Hello, I have read your WWW20 paper and your code recently. It's an interesting work.
I currently have one question about the data split process in your paper.
You mentioned how you split data for semi-supervised learning models
Q1: Why do you have other "Train %" column in Table-3? You don't follow abovementioned data split? Or you use abovementioned semi-sup learning setting to get the embedding and further use the training rate setting in Table-3 to train another classifier? but if in this case, could we still call it as semi-supervised learning?
Q2, How do you run GNN baselines, e.g., GCN GAT? and in Table-3 they correspond to different training rate as well. It's confusing.
Line 18 in 144f39a
Can you elaborate what does etypes_list signifies, and how do you consider the following list? Thanks.
Or can you tell me how you processed the datasets?
Hi, I tried to dive into the source code of load_DBLP_data() function and print the intermediate variables. However, I still have trouble in understanding the data format. Could you give some explanation? Thank you!
I face the error when I run my own dataset:
Traceback (most recent call last):
File "run_mydata.py", line 223, in <module>
args.epoch, args.patience, args.batch_size, args.samples, args.repeat, args.save_postfix)
File "run_mydata.py", line 96, in run_model_DBLP
adjlists, edge_metapath_indices_list, train_idx_batch, device, neighbor_samples)
File "/MAGNN-master/utils/tools.py", line 112, in parse_minibatch
[adjlist[i] for i in idx_batch], [indices[i] for i in idx_batch], samples)
File "/MAGNN-master/utils/tools.py", line 72, in parse_adjlist
num = len(edge_metapath_indices[0][0])
IndexError: index 0 is out of bounds for axis 0 with size 0
I think maybe it's because the code restrict that any node on any meta-path should have neighbors. Could you give me some advice how to fix it?
在FM数据集处理时,user和artist在处理流程和采样流程中反复进行偏移,为什么不把user和artist和其它节点统一编码,从头到尾一个节点只有唯一的id呢
In run_DBLP.py, I see that the model is only trained with supervised loss. Can you please confirm this as the paper also mentions an unsupervised loss?
Last.fm is too slow when training because of its size. It takes about 2 hours for one epoch, which make it almost impossible for me to use gridsearch to get a good or better result.
It is a good idea if I can use a smaller Last.fm or a part of it. Is it possible?
I tried change num_user = 1892, num_artist = 17632, num_tag = 11945 to smaller values in preprocess_LastFM.ipynb, but some errors occur.
Can someone tell me, is there any easy way to use a smaller Last.fm. thanks a lot.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.