Giter Site home page Giter Site logo

marfox / strephit Goto Github PK

View Code? Open in Web Editor NEW

This project forked from wikidata/strephit

0.0 2.0 0.0 8.34 MB

An intelligent reading agent that understands text and translates it into Wikidata statements.

Home Page: https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

License: GNU General Public License v3.0

Python 96.81% HTML 3.14% JavaScript 0.05%

strephit's Introduction

StrepHit

StrepHit is a Natural Language Processing pipeline that understands human language, extracts facts from text and produces Wikidata statements with references.

StrepHit is a IEG project funded by the Wikimedia Foundation.

StrepHit will enhance the data quality of Wikidata by suggesting references to validate statements, and will help Wikidata become the gold-standard hub of the Open Data landscape.

Official Project Page

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

Documentation

https://www.mediawiki.org/wiki/StrepHit

Features

Pipeline

  1. Corpus Harvesting
  2. Corpus Analysis
  3. Sentence Extraction
  4. N-ary Relation Extraction
  5. Dataset Serialization

Get Ready

  • Install Python 2.7 and pip
  • Clone the repository and create the output folder:
$ git clone https://github.com/Wikidata/StrepHit.git
$ mkdir StrepHit/output
  • Install all the Python requirements (preferably in a virtualenv)
$ cd StrepHit
$ pip install -r requirements.txt
NEX_URL = 'https://api.dandelion.eu/datatxt/nex/v1/'
NEX_TOKEN = 'your API token here'

Optional dependency

If you want to extract sentences via syntactic parsing, you will need to install:

$ python -m strephit commons download stanford_corenlp

Command Line

You can run all the NLP pipeline components through a command line. Do not specify any argument, or use --help to see the available options. Each command can have a set of sub-commands, depending on its granularity.

$ python -m strephit                                                                             
Usage: __main__.py [OPTIONS] COMMAND [ARGS]...

Options:
  --log-level <TEXT CHOICE>...
  --cache-dir DIRECTORY
  --help                        Show this message and exit.

Commands:
  annotation          Corpus annotation via crowdsourcing
  classification      Roles classification
  commons             Common utilities used by others
  corpus_analysis     Corpus analysis module
  extraction          Data extraction from the corpus
  rule_based          Unsupervised fact extraction
  side_projects       Side projects scripts
  web_sources_corpus  Corpus retrieval from the web

Get Started

  • Generate a dataset of Wikidata assertions (QuickStatements syntax) from semi-structured data in the corpus (takes time, and a good internet connection):
$ python -m strephit extraction process_semistructured -p 1 samples/corpus.jsonlines
  • Produce a ranking of meaningful verbs:
$ python -m strephit commons pos_tag samples/corpus.jsonlines bio en
$ python -m strephit corpus_analysis rank_verbs output/pos_tagged.jsonlines bio en
$ python -m strephit extraction extract_sentences samples/corpus.jsonlines output/verbs.json en
$ python -m strephit commons entity_linking -p 1 output/sentences.jsonlines en
  • Extract facts with the rule-based classifier:
$ python -m strephit rule_based classify output/entity_linked.jsonlines samples/lexical_db.json en
  • Train the supervised classifier and extract facts:
$ python -m strephit annotation parse_results samples/crowdflower_results.csv
$ python -m strephit classification train output/training_set.jsonlines en
$ python -m strephit classification classify output/entity_linked.jsonlines output/classifier_model.pkl en
  • Serialize the supervised classification results into a dataset of Wikidata assertions (QuickStatements):
$ python -m strephit commons serialize -p 1 output/supervised_classified.jsonlines samples/lexical_db.json en

N.B.: you will find all the output files in the output folder.

Note on Parallel Processing

By default, StrepHit uses as many processes as the number of CPU cores in the machine where it runs. Add the -p parameter if you want to change the behavior.

Set -p 1 to disable parallel processing.

License

The source code is under the terms of the GNU General Public License, version 3.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.