Giter Site home page Giter Site logo

capasitore / neuraltripletranslation Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yueliu/neuraltripletranslation

0.0 1.0 0.0 85.86 MB

Seq2RDF: end-to-end Neural Triple Translation

License: Apache License 2.0

C++ 5.24% Objective-C 1.21% C 3.29% Python 60.10% Shell 0.03% HTML 1.48% CSS 1.59% JavaScript 27.05%

neuraltripletranslation's Introduction

Neural Triple Translation

An end-to-end tool for generating triples from natural language sentences. Given a natural language sentence, this code determine the entities and identify relationships between entities and generate a corresponding RDF triple which is aligned with the trained knwoeldge graph vocabularies. This repo contains the source code for ISWC'18 paper Seq2RDF: An end-to-end application for deriving Triples from Natural Language Text.

A Demo video is available.

If you use the code, please cite the following paper:

 @inproceedings{liu2018seq2rdf,
  author    = {Liu, Yue and Zhang, Tongtao and Liang, Zhicheng and Ji, Heng and McGuinness, Deborah L},
  title     = {Seq2RDF: An End-to-end Application for Deriving Triples from Natural
               Language Text},
  booktitle = {Proceedings of the {ISWC} 2018 Posters {\&} Demonstrations, Industry
               and Blue Sky Ideas Tracks co-located with 17th International Semantic
               Web Conference {(ISWC} 2018)},
  year      = {2018}
}

Preprocess

The preprocess folder contains the implementation of distant supervision and data cleanning using certain NLP tools and regular expressions with regard to owls and txts respectively.

Data

We ran Stanford NER on Wiki-DBpedia training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:

  • ADE: Adverse drug events dataset
  • NYT (Riedel et al., 2011): 1.18M sentences sampled from 294K New York Times news articles. 395 sentences are manually annotated with 24 relation types and 47 entity types. (Download JSON)
  • Wiki-DB: the training corpus contains 51k sentences sampled from Wikipedia articles. Test data consists of 2k mannually labeled sentences. It has 37 relation types and 46 entity types after filtering of numeric value-related relations. (Download)

Dependencies

  • python 3
  • flask
  • tensorflow
  • json
  • numpy
  • pytorch

We used Anaconda for this work.

Hyperparameters

Detailed information for parameters and external resources

Parameter Value Description
embedding dimension 512 Randomized word embedding dimension
Word2Vec January 1st, 2017 English Wikipedia dump pretrained Word2Vec training corpus
Word2Vec window size 5 pretrained Word2Vec
Word2Vec dimension 200 pretrained Word2Vec
KG embedding dimension 100 TransE
Size of hidden layers 128 all structures
Dropout rate 5%
learning rate 5% Adam optimizer

For our proposed approach, we randomly initialize a 512-dimension vector for each token for both encoders and decoders in the training sentences and triples, as well as Unknown, Start-of-Triple and Padding tokens. The hidden sizes for LSTM networks and attention networks are all set to 128. The dropout rate is set to 5% (0.05), indicating that there is a probability of 5% for each token in the training sentences to be masked as an Unknown token. This dropout processing can make our approach more robust against out-of-vocabulary issues in the testing sentences.

KG embeddings training

We learned the 100-dimension embeddings from 2,497,196 DBpedia triples that includes 1,238,671 distinct entities and 3,359 relations using the TransE model. The code is based on this Open-Source Package for Knowledge Embedding and modified in order to run in a non-GPU environment.

Default Run

Run main.py under ./baselines for running the baseline of Relation Extraction on the Wiki-DBpedia dataset

Run main.py for deploy the flask web-application

Two needed tensorflow outputs .tfmodel.data- and the .pkl file are not uploaded here due to the size issue. You should be able to get them after training.

Contact

The code base is a joint work by

  • Yue Liu and Tongtao Zhang for the implementation of distant supervision, data cleaning, and the Seq2Seq model
  • Zhicheng Liang for the implementation of relation extraction baselines and KG embedding training.

If you have any question in usage, please contact us by opening an issue and we will be glad to resolve it. And contributing is always welcome.

neuraltripletranslation's People

Contributors

yueliu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.