Giter Site home page Giter Site logo

keqa_wsdm19's Introduction

Knowledge Graph Embedding Based Question Answering

Knowledge Graph Embedding Based Question Answering, WSDM 2019

Installation

  • Requirements
  1. fuzzywuzzy
  2. scikit-learn
  3. torchtext
  4. nltk
  5. pytorch
  6. numpy
  • Usage
  1. cd KEQA_WSDM19
  2. pip install -r requirements.txt
  3. sh main.sh

Reference in BibTeX:

@conference{Huang-etal19Knowledge,
Title = {Knowledge Graph Embedding Based Question Answering},
Author = {Xiao Huang and Jingyuan Zhang and Dingcheng Li and Ping Li},
Booktitle = {ACM International Conference on Web Search and Data Mining},
Year = {2019}}

keqa_wsdm19's People

Contributors

junnandong avatar xhuang31 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

keqa_wsdm19's Issues

An Issue about the Statistics of FB2M and FB5M

Hi, thank you very much for sharing the code and data! But I found an issue with the FB2M and FB5M datasets shared in https://www.dropbox.com/s/9lxudhdfpfkihr1/data.zip.

The paper (https://arxiv.org/pdf/1506.02075.pdf) reports the statistics of FB2M and FB5M in Table-2, and it says, for FB2M, No. entities 2,150,604; No. relations 6,701; No. atomic facts 14,180,937; However, in your dataset, for FB2M, No. entities 1,963,130; No. relations 6,701; No. atomic facts 14,174,246;

So, there are differences between the No. entities and No. triples of your data and the reported statistics (issue found in both FB2M and FB5M). I don't know if you have also noticed this issue.

Please bear with me if I made a mistake. Many thanks!

Issue with train_entity.py

I am replicating your work so that I can use it for question answering. I am using a GPU and get an error at for batch_idx, batch in enumerate(train_iter): in the file train_entity.py in the root directory.

The errror states
File "/home/User/.local/lib/python3.6/site-packages/torchtext/data/field.py", line 184, in numericalize arr = self.tensor_type(arr) ValueError: too many dimensions 'str'

Can you help me with this error please?

Prediction on a Custom Dataset

How the above solution can be leveraged to work for any custom dataset. I mean I want to know about the pipeline which needs to be followed in order to perform a simple QA on a custom document?

The training of TransE

I didn't find the ‘ transE_emb.py ’ file in your code, so I would like to ask how the initialization vector of transE is represented in the training, or can you give me the ‘ transE_emb.py ’ file? At the same time, I have a question about KEQA. The vectorized representation obtained after KEQA is not in the same vector space as the embedded representation TransE, so when the Euclidean distance between the two is found, will there be an error match?

IndexError: index out of range in self

when I run test_main.py,I meet the error like this:
Traceback (most recent call last):
File "test_main.py", line 115, in
dete_result, question_list = entity_predict(dataset_iter=test_iter)
File "test_main.py", line 41, in entity_predict
answer = torch.max(model(data_batch), 1)[1].view(data_batch.ed.size())
File "C:\Users\DW.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Code\KEQA_WSDM19-master\entity_detection.py", line 39, in forward
x = self.embed(text)
File "C:\Users\DW.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\DW.conda\envs\pytorch\lib\site-packages\torch\nn\modules\sparse.py", line 162, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "C:\Users\DW.conda\envs\pytorch\lib\site-packages\torch\nn\functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

I wonder why this error occurs,and what the solution is.
Thanks very much!

Reproducability

Thanks for your work.

Using the script you provided we can access the following graph
cleanedFB.txt
Nevertheless, In your experiments you mention two graphs and these are not provided
along with the corresponding questions.
Could you please provide all the data required for the simulations?

Thank you again.

I have some confusions,Can you help me?

1.I find just part of entities and triples of FB2M are used in KEQA, and the baselines may use the whole FB2M, So the number of candidate answers of these methods are different, can we compare the result like this directly?
2.Why the final accuracy and all acc are different in the test result?I think if we know the head and relation,the tail should be unique in SimpleQuestions?
Looking forward to your reply,Thank you!

The KG used in train_entity.py appears to be smaller

Hello,

Thank you for sharing your work. I found the link of KGembed.zip in your main.sh, which I believe is the entity embeddings pre-trained by TransE. However, when debugging in train_entity.py, it seems like this entities_emb.bin contains only 647639 entities, while there suppose to be around 2 million entities in FB2M. Can I assume that you somehow trimmed the original FB2M dataset and created a smaller subgraph that contains only 647639 entities?

Thank you for your time.

issue in train_detection.py

https://github.com/xhuang31/KEQA_WSDM19/blob/ba89ecd95b9835b96d3241caee77f08cfd9ffa8d/train_detection.py#L40

try to run the training, its throwing me below error:

python train_detection.py --entity_detection_mode LSTM --fix_embed --gpu 0

line =m.03byqr1 m.033th cvg.computer_videogame.cvg_genre
item = ['m.03byqr1', 'm.033th', 'cvg.computer_videogame.cvg_genre']
Traceback (most recent call last):
File "train_detection.py", line 42, in
tokens = items[6].split()
IndexError: list index out of range

not able to understand why 6th index of item.
can you please explain why??

File Not Found Error.

Hello.

When I run main.sh, I get "FileNotFoundError: [Errno 2] No such file or directory: 'preprocess/dete_best_model.pt'" at test time. How shoud I get this file?

an error

Hello:when i was running,an error occured: File "train_detection.py", line 63, in
train = data.TabularDataset(path=os.path.join(args.output, 'dete_train.txt'), format='tsv', fields=[('text', TEXT), ('ed', ED)])
File "D:\software\Anaconda3\lib\site-packages\torchtext\utils.py", line 130, in unicode_csv_reader
csv.field_size_limit(sys.maxsize)
OverflowError: Python int too large to convert to C long

Incomplete support for CPU

There is a bug when trying to use CPU using --no_cuda:

When using --no_cuda, args.gpu would be -1. But later when 'data.Iterator' is called, the args.gpu is passed into 'torch.device()' without checking. So there will be an error saying that parameter in 'torch.device()' cannot be negative as args.gpu == -1.

This bug exists in train_detection.py, train_entity.py, train_pred.py and test_main.py.

train_entity.py 头实体表示学习模型训练精度只有63

训练了很多epoch,开发集精度只有63
看了论文原文,并没有提到这部分的训练效果,请问作者也是这样吗?
我感觉这个低了后续任务肯定受影响的
Dev Accuracy: 0.6357314148681055
411 17 40001 2049/2372 86% 0.003043
431 18 42001 1677/2372 71% 0.003203
451 19 44001 1305/2372 55% 0.002822
471 20 46001 933/2372 39% 0.003007
490 21 48001 561/2372 24% 0.003182
Dev Accuracy: 0.6398081534772182
513 22 50001 189/2372 8% 0.003154
533 22 52001 2189/2372 92% 0.002966
552 23 54001 1817/2372 77% 0.002975
572 24 56001 1445/2372 61% 0.002783
592 25 58001 1073/2372 45% 0.003067
Dev Accuracy: 0.6352517985611511
615 26 60001 701/2372 30% 0.002484
634 27 62001 329/2372 14% 0.002541
654 27 64001 2329/2372 98% 0.002770
674 28 66001 1957/2372 83% 0.002673
694 29 68001 1585/2372 67% 0.003463
Dev Accuracy: 0.6352517985611511
716 30 70001 1213/2372 51% 0.002752
736 31 72001 841/2372 35% 0.002837
756 32 74001 469/2372 20% 0.002801
776 33 76001 97/2372 4% 0.002722
795 33 78001 2097/2372 88% 0.002720
Dev Accuracy: 0.6302158273381295
818 34 80001 1725/2372 73% 0.002518
838 35 82001 1353/2372 57% 0.002832
858 36 84001 981/2372 41% 0.002269
878 37 86001 609/2372 26% 0.002469
898 38 88001 237/2372 10% 0.002821

ValueError: cannot reshape array of size 161914250 into shape (647639,250)

Entity representation learning...
Traceback (most recent call last):
File "train_entity.py", line 63, in
entities_emb = np.fromfile(os.path.join(args.output, 'entities_emb.bin'), dtype=np.float32).reshape((len(mid_dic), args.embed_dim))
ValueError: cannot reshape array of size 161914250 into shape (647639,250)
检查了 entity2id.txt,发现只有 647639,
而entities_emb.bin读入后有161914250,#161914250/250=647657
请问如何解决?

Test on FB2M

Hi ! Thanks for sharing the code. I would like to ask how to test the model on FB2M ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.