Giter Site home page Giter Site logo

ecocor-extractor's Introduction

ecocor-extractor

Description

This repository hosts a script which extracts frequencies of words from a given word list in text segments. As input a JSON object is passed over an API which saves the text segment and their IDs and optionally a URL to the word list on which basis the freqeuncies are extracted. It returns a JSON in which for each word the frequency per text segment is saved.

Formats

It is required that the keys in the Input and the WordList are as below:

  • Input Format: {"segments": [{"segment_id":"xyz", "text":"asd ..."}, ...], "language":"de", "word_list":{"url":"http://..."}}
  • WordList Format: [{"word":abc, "wikidata_ID":"Q12345","category":"plant"}]
  • Output Format: [{"word":"xyz", "wikidata_ID":"Q12345","category":"plant", "overall_frequency":1234, "segment_frequencies":{segment_id:1234,...}}]

Requirements

This scripts requires spacy and FastAPI to be installed. Additionally the spacy models for English and German must be downloaded: de_core_news_sm, en_core_web_sm

Test

The script was tested using uvicorn. Unittests are provided in test/ and can be executed with python -m unittest test/test_extractor.py The following curl command can be used to post a JSON file to the extractor service: curl -X POST -H "Content-Type: application/json" 127.0.0.1:8000/extractor --data-binary @test/test.json. The items in the response should match test/result.json.

ecocor-extractor's People

Contributors

danilsko avatar hsluytergaethje avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.