deep-polyu / keqa_wsdm19 Goto Github PK

View Code? Open in Web Editor NEW

156.0 7.0 51.0 48 KB

Knowledge Graph Embedding Based Question Answering

Python 98.32% Shell 1.68%

keqa_wsdm19's Introduction

Knowledge Graph Embedding Based Question Answering

Knowledge Graph Embedding Based Question Answering, WSDM 2019

Installation

Requirements

fuzzywuzzy
scikit-learn
torchtext
nltk
pytorch
numpy

Usage

cd KEQA_WSDM19
pip install -r requirements.txt
sh main.sh

Reference in BibTeX:

@conference{Huang-etal19Knowledge,
Title = {Knowledge Graph Embedding Based Question Answering},
Author = {Xiao Huang and Jingyuan Zhang and Dingcheng Li and Ping Li},
Booktitle = {ACM International Conference on Web Search and Data Mining},
Year = {2019}}

keqa_wsdm19's People

Contributors

Stargazers

Watchers

Forkers

melusinee rezacsedu wlq666 miyamm zhengxiaoxuer liuxiaolongcc imonbayazid ares5221 zylhub awesome-archive hatleon liuxiaolong98 embeddedsamurai xiaolinpeter fanfanba ishine yzhang37 bugwf xichunling juihsuanlee microw kelly2016 spiritabc kiranramnath007 wdmwdm2008 mafrasiabi y-grace zhaoyuhang zy1417548204 zhezhong rollben baylee001 pwforks yangjianhang nomoreblabla wangdongde liziyan1997 tttyyyqqq lxmm1999 z974890869 azhe1234 lzeeorno justyyau baekhyun-56 paulrich1234 huijuer realfolkcode dalerxli tiffen munirabobaker px6927

keqa_wsdm19's Issues

ValueError: cannot reshape array of size 161914250 into shape (647639,250)

Entity representation learning...
Traceback (most recent call last):
File "train_entity.py", line 63, in
entities_emb = np.fromfile(os.path.join(args.output, 'entities_emb.bin'), dtype=np.float32).reshape((len(mid_dic), args.embed_dim))
ValueError: cannot reshape array of size 161914250 into shape (647639,250)
检查了 entity2id.txt,发现只有 647639，
而entities_emb.bin读入后有161914250，#161914250/250=647657
请问如何解决？

代码里面有这个实体的人工校正，是因为freebase的数据集的问题吗？或者是其他原因？求救~

File Not Found Error.

Hello.

When I run main.sh, I get "FileNotFoundError: [Errno 2] No such file or directory: 'preprocess/dete_best_model.pt'" at test time. How shoud I get this file?

Prediction on a Custom Dataset

How the above solution can be leveraged to work for any custom dataset. I mean I want to know about the pipeline which needs to be followed in order to perform a simple QA on a custom document?

IndexError: index out of range in self

when I run test_main.py,I meet the error like this:
Traceback (most recent call last):
File "test_main.py", line 115, in
dete_result, question_list = entity_predict(dataset_iter=test_iter)
File "test_main.py", line 41, in entity_predict
answer = torch.max(model(data_batch), 1)[1].view(data_batch.ed.size())
File "C:\Users\DW.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Code\KEQA_WSDM19-master\entity_detection.py", line 39, in forward
x = self.embed(text)
File "C:\Users\DW.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\DW.conda\envs\pytorch\lib\site-packages\torch\nn\modules\sparse.py", line 162, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "C:\Users\DW.conda\envs\pytorch\lib\site-packages\torch\nn\functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

I wonder why this error occurs,and what the solution is.
Thanks very much!

Reproducability

Thanks for your work.

Using the script you provided we can access the following graph
cleanedFB.txt
Nevertheless, In your experiments you mention two graphs and these are not provided
along with the corresponding questions.
Could you please provide all the data required for the simulations?

Thank you again.

error in entity2id

I use the embedding data which is downloaded from "https://www.dropbox.com/s/o5hd8lnr5c0l6hj/KGembed.zip", I found that not every tail entity in SimpleQuestions can get an id from the entity2id.txt, that is to say, entity2id.txt does not contain all entities in SimpleQuestions. Do you know why?

Issue with train_entity.py

I am replicating your work so that I can use it for question answering. I am using a GPU and get an error at for batch_idx, batch in enumerate(train_iter): in the file train_entity.py in the root directory.

The errror states
File "/home/User/.local/lib/python3.6/site-packages/torchtext/data/field.py", line 184, in numericalize arr = self.tensor_type(arr) ValueError: too many dimensions 'str'

Can you help me with this error please?

Dropbox data cannot be accessed due to some indescribable reasons.

Dropbox data cannot be accessed due to some indescribable reasons. Could you provide baidu cloud sharing.

Thank you very much

请问这些文件我并没有在下载中找到这些preprocess/train.txt,preprocess/valid.txt文件，请问可以在哪里获得这些文件呢？谢谢

I have some confusions，Can you help me?

1.I find just part of entities and triples of FB2M are used in KEQA, and the baselines may use the whole FB2M, So the number of candidate answers of these methods are different, can we compare the result like this directly?
2.Why the final accuracy and all acc are different in the test result?I think if we know the head and relation,the tail should be unique in SimpleQuestions?
Looking forward to your reply，Thank you!

Attention Weight Bug in Embedding

https://github.com/xhuang31/KEQA_WSDM19/blob/f923a15cc732e8844c26e4c653c306c90e067734/embedding.py#L56

There is a bug in this line, because 'self.attn(torch.cat((x, outputs), 1)' generate a tensor which size is
(seq_len*batch_size, 1), which results in all softmax results to be 1.0.

An Issue about the Statistics of FB2M and FB5M

Hi, thank you very much for sharing the code and data! But I found an issue with the FB2M and FB5M datasets shared in https://www.dropbox.com/s/9lxudhdfpfkihr1/data.zip.

The paper (https://arxiv.org/pdf/1506.02075.pdf) reports the statistics of FB2M and FB5M in Table-2, and it says, for FB2M, No. entities 2,150,604; No. relations 6,701; No. atomic facts 14,180,937; However, in your dataset, for FB2M, No. entities 1,963,130; No. relations 6,701; No. atomic facts 14,174,246;

So, there are differences between the No. entities and No. triples of your data and the reported statistics (issue found in both FB2M and FB5M). I don't know if you have also noticed this issue.

Please bear with me if I made a mistake. Many thanks!

the data cannot be downloaded

Hello, there will be problems when running the program, it will not work, and the data cannot be downloaded

issue in train_detection.py

https://github.com/xhuang31/KEQA_WSDM19/blob/ba89ecd95b9835b96d3241caee77f08cfd9ffa8d/train_detection.py#L40

try to run the training, its throwing me below error:

python train_detection.py --entity_detection_mode LSTM --fix_embed --gpu 0

line =m.03byqr1 m.033th cvg.computer_videogame.cvg_genre
item = ['m.03byqr1', 'm.033th', 'cvg.computer_videogame.cvg_genre']
Traceback (most recent call last):
File "train_detection.py", line 42, in
tokens = items[6].split()
IndexError: list index out of range

not able to understand why 6th index of item.
can you please explain why??

TypeError: can’t convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first

I had error in this lines of code. The error is
TypeError: can’t convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Can you help me?
Thank you

train_entity.py 头实体表示学习模型训练精度只有63

训练了很多epoch，开发集精度只有63
看了论文原文，并没有提到这部分的训练效果，请问作者也是这样吗？
我感觉这个低了后续任务肯定受影响的
Dev Accuracy: 0.6357314148681055
411 17 40001 2049/2372 86% 0.003043
431 18 42001 1677/2372 71% 0.003203
451 19 44001 1305/2372 55% 0.002822
471 20 46001 933/2372 39% 0.003007
490 21 48001 561/2372 24% 0.003182
Dev Accuracy: 0.6398081534772182
513 22 50001 189/2372 8% 0.003154
533 22 52001 2189/2372 92% 0.002966
552 23 54001 1817/2372 77% 0.002975
572 24 56001 1445/2372 61% 0.002783
592 25 58001 1073/2372 45% 0.003067
Dev Accuracy: 0.6352517985611511
615 26 60001 701/2372 30% 0.002484
634 27 62001 329/2372 14% 0.002541
654 27 64001 2329/2372 98% 0.002770
674 28 66001 1957/2372 83% 0.002673
694 29 68001 1585/2372 67% 0.003463
Dev Accuracy: 0.6352517985611511
716 30 70001 1213/2372 51% 0.002752
736 31 72001 841/2372 35% 0.002837
756 32 74001 469/2372 20% 0.002801
776 33 76001 97/2372 4% 0.002722
795 33 78001 2097/2372 88% 0.002720
Dev Accuracy: 0.6302158273381295
818 34 80001 1725/2372 73% 0.002518
838 35 82001 1353/2372 57% 0.002832
858 36 84001 981/2372 41% 0.002269
878 37 86001 609/2372 26% 0.002469
898 38 88001 237/2372 10% 0.002821

老师您好，我最近在更换其他的数据集来实现这个代码。您可以将python3.6 transE_emb.py --learning_rate 0.003 --batch_size 3000 --eval_freq 50这个文件提供给我吗？

我知道python3.6 transE_emb.py --learning_rate 0.003 --batch_size 3000 --eval_freq 50是别人的成果，但是我在自己执行TransE时，遇到了一些问题，想查看下您是怎么做的，只是用于自学目的，十分感谢！

The training of TransE

I didn't find the ‘ transE_emb.py ’ file in your code, so I would like to ask how the initialization vector of transE is represented in the training, or can you give me the ‘ transE_emb.py ’ file? At the same time, I have a question about KEQA. The vectorized representation obtained after KEQA is not in the same vector space as the embedded representation TransE, so when the Euclidean distance between the two is found, will there be an error match?

Test on FB2M

Hi ! Thanks for sharing the code. I would like to ask how to test the model on FB2M ?

The KG used in train_entity.py appears to be smaller

Hello,

Thank you for sharing your work. I found the link of KGembed.zip in your main.sh, which I believe is the entity embeddings pre-trained by TransE. However, when debugging in train_entity.py, it seems like this entities_emb.bin contains only 647639 entities, while there suppose to be around 2 million entities in FB2M. Can I assume that you somehow trimmed the original FB2M dataset and created a smaller subgraph that contains only 647639 entities?

Thank you for your time.

Problem in beginning the training

FileNotFoundError: [Errno 2] No such file or directory: 'preprocess/dete_best_model.pt'

显示缺少文件，请问这个问题应该如何解决

Incomplete support for CPU

There is a bug when trying to use CPU using --no_cuda:

When using --no_cuda, args.gpu would be -1. But later when 'data.Iterator' is called, the args.gpu is passed into 'torch.device()' without checking. So there will be an error saying that parameter in 'torch.device()' cannot be negative as args.gpu == -1.

This bug exists in train_detection.py, train_entity.py, train_pred.py and test_main.py.

an error

Hello：when i was running，an error occured： File "train_detection.py", line 63, in
train = data.TabularDataset(path=os.path.join(args.output, 'dete_train.txt'), format='tsv', fields=[('text', TEXT), ('ed', ED)])
File "D:\software\Anaconda3\lib\site-packages\torchtext\utils.py", line 130, in unicode_csv_reader
csv.field_size_limit(sys.maxsize)
OverflowError: Python int too large to convert to C long

Error : ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /Data/komal/anaconda3/envs/pyg/lib/python3.8/site-packages/scipy/linalg/_matfuncs_sqrtm_triu.cpython-38-x86_64-linux-gnu.so)

deep-polyu / keqa_wsdm19 Goto Github PK

keqa_wsdm19's Introduction

Knowledge Graph Embedding Based Question Answering

Installation

Reference in BibTeX:

keqa_wsdm19's People

Contributors

Stargazers

Watchers

Forkers

keqa_wsdm19's Issues

Recommend Projects

Recommend Topics

Recommend Org